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ABSTRACT 

A set of pilot performance reference scales was 
developed based upon airborne Audio-Video Recording (AVR) of student 
performance in T-37 undergraduate Pilot . Training. After selection of 
the training maneuvers to be studied, video tape ^o^ings ^he 
maneuvers were selected from video tape recordings already ava 
from ^previous research effort. Those discriminable performance 
events which could be observed using the video tapes were 
and preliminary performance scales were developed to evaluat® the 
video version of student performance. Through assessment and refining 
of the preliminary scales, the final pilot performance . reference 
scales were developed. These scales were used by experienced 
instructor pilots to evaluate the performances shown, and results of 
fhese evaluations were analyzed. The study indicated that ( a ) 
audio-video recordings of in-flight performance can serve as the 
basis for the efficient development of pilot performance reference 
scales; and (b) video tapes can provide sufficient information for 
performance evaluation purposes. (Author/AG) 
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ABSTRACT 



This report describes the results of a study to develop pilot performance reference 
scales based upon audio-video recordings of in -(light performances of students undergoing 
T-37 undergraduate pilot training. The study included scale development as well as t ic 
test and evaluation of each scale. All the maneuvers contained on the in-flight recordings 
were analyzed, and constituent performance elements observable on the video replay 
were identified. Three maneuvers, Final Turn to Landing, Vertical S “A, and Lazy Light, 
were selected for the final scaling effort. Ten performance elements each were identified 
for the Lazy Light and Vertical S “A” maneuvers, and twelve elements for the hnal Turn 
to Landing. A performance reference scale was developed for each maneuver, Each scale 
consisted of a series of subscales for rating performance on each of the elements of the 
maneuver and an additional subscale for rating the overall performance of the maneuver. 
Although some elements were common to more than one maneuver, the rating scales for 
these elements were tailored in each case to the maneuver involved. Each subscalc 
consisted of a ten-point rating line (a row of ten boxes) representing the full range of 
performance from “unsatisfactory” to “excellent” and, beneath, four graded 
verbalizations describing different levels of performance. No verbalizations were 
presented, however, with the subscale used for rating overall performance. Final vcrs.ons 
of the scales were subjected to a test and evaluation through their utilization by 
experienced instructor pilots. These pilots assigned levels of performance based upon 
what they observed on video replays of selected maneuver examples. The results showed 
the overall reliability of scales for the three maneuvers was high but that the majority of 
the individual clement scales were of a relatively low to medium degree of reliability . The 
results arc believed to justify more in-depth analysis of the data and continued 
development efforts to refine and increase the scope of scale application. 



SUMMARY 



Horner, W.R., Radinsky, T.L., & Fitzpatrick, R. The development, test, and evaluation of three pilot 

performance reference scales. AFHRL-TR-70-22. Williams AFB, Ariz: Flying Training Division, Air 

Force Human Resources Laboratory, August 1970. 

Problem * 

Recent emphasis lias been given to the experimental evaluation of airborne Audio-Video Recording 
(AVR) as a technique for enhancing the training effectiveness of various Air Force flying training programs. 
A first study showed significant training gains when airborne AVR was used to supplement Undergraduate 
Pilot Training in the T-37 phase of training. Subsequent studies arc now underway to further define the 
training value of airborne AVR when used for (a) gunnery training in Combat Crew Training Schools; (ft) 
Pilot Instructor Training; and (c) as an aid to gunnery training by the use of AVR through A7D Head-Up 
Display. 

Because airborne AVR appears to offer significant training advantages through its ability to provide 
rapid knowledge of results of student performance, it is also appropriate to consider the potential of 
airborne AVR as a source for student performance evaluation. The possible value of AVR in this regard is 
twofold: (a) as a fool for the initial development of improved pcrfomiance evaluation scales and (ft) as the 
prime source of student performance against which the performance scales are applied. The present study 
represents a first effort to quantify the value of airbonc AVR as a tool for scale development as well as a 
source of student performance for subsequent evaluation. 

Approach 

A set of pilot performance reference scales was developed based upon airborne AVR of student 
performance in T-37 Undergraduate Pilot Training. After selection of the training maneuvers to be studied, 
video tape recordings of the maneuvers were selected from video tape recordings already available from a 
previous research effort. Those discriminable performance events which could be observed using the video 
tapes were defined, and preliminary performance scales were developed to evaluate the video version of 
student performance. Through assessment and refining of the preliminary scales, the final pilot performance 
reference scales were developed. These scales were used by experienced instructor pilots to evaluate the 
performances shown, and results of these evaluations were analyzed. 

Results 

As a result of the analysis, three UPT Syllabus maneuvers were chosen as the basis for scale 
development: Final Turn to Landing, Lazy Eight, and Vertical S “A”. The subsequent scale development 
was highlighted by the following results: (a) Inconsistent and unpredictable switching between inside and 
outside video views often eliminated critical performance information, (ft) Resolution of video-replay was 
often less than desired, (c) The Lazy Eight maneuver (with many outside scenes) was more difficult to score 
than the basic instrument maneuver of Vertical S “A”, (d) All intervals of a 10-interval scale were used by 
instructors with low variability when applied shortly after video replay of performance, (e) Instructor use 
of scales showed high agreement for exemplary performance, but greater variability for poor performance. 
(J) The increased sensitivity of scales identified student problem areas more effectively than operational 
performance measures. ( g ) Instructors showed high agreement as to which task elements could be measured 
from VTR. ( h ) Bctween-group mean reliability was high with experimental scales. 

Conclusions . 

Even though the pilot performance reference scales developed under this program were relatively 
cumbersome, and not immediately adaptable to operational use, the study did demonstrate that (a) 
audio-video recordings of in-flight performance can serve as the basis for the efficient development of pilot 
performance reference scales; and (ft) video tapes can provide sufficient information for performance 
evaluation purposes. As audio-video recording of various in-flight maneuvers continues to grow as a 
function of improved AVR equipments and increased utilization as a training aid, efforts should be 
continued to fully utilize airborne AVR as a performance measurement device. This is particularly true in 
those instances where the recorded visual field contains most of the information required for evaluation 
purposes, such as instrument flight, gun-sight, and head-lip display. 

This summary was prepared by Milton E. Wood, Flying Training Division, Air Force Human 
Resources Laboratory. 
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THE DEVELOPMENT, TEST, AND EVALUATION OF THREE PILOT 
PERFORMANCE REFERENCE SCALES 



I. INTRODUCTION 
Background 

This report describes the development, test, and 
evaluation of three pilot performance reference 
scales based on video-tape recordings of in-flight 
performance during Air Force Undergraduate Pilot 
Training (UPT). Each of the scales refers to a 
specific maneuver and consists of a scries of 
subscalcs in which varying levels of performance 
are distinguished for each of the maneuver 
elements, along with an overall summary scale. A 
series of video-tape recordings was also produced 
containing illustrative examples of performance 
levels for each of the three maneuvers. 

This project was made feasible because audio- 
video recordings were being made of student pilot 
performance during the UPT program at Vance Air 
Force Base under a separate contract. This was 
being done as an experimental evaluation of the 
usefulness of in-flight performance recordings in 
certain aspects of the pilot training curriculum 
(Neese, 1968; Purifoy, 1968; Schumacher, Rudov, 
& Valverde, 1969). 

The availability of video-tape recordings of 
student pilot performance represented an oppor- 
tunity for research. Normally, the instructor pilot 
(IP) is the only observer of a student’s perform- 
ance. He sees the performance fleetingly, while at 
the same time he is coaching the student and 
scanning for competing traffic. His evaluation of 
the performance may not always be highly 
accurate and reliable because of these other 
preoccupations, the complexity of the task of 
flight training, or the use of other unrelated 
factors to arrive at a recorded grade. There is 
normally no way to determine the accuracy and 
reliability of his evaluations. Hence, efforts to 
carry out research aimed at improving training are 
hampered by the lack of any opportunity for 
comparing evaluations. 

The availability of video recordings makes it 
possible for more than one instructor to observe 
and evaluate a given student performance, and to 
do this on more than one occasion. Thus, it 
becomes possible to compare evaluations and to 
take steps, if necessary, to improve them. The 
standardization of instructor judgments might 
then be furthered. Standard video tapes illustrating 



varying levels of student performance could be 
used to advantage in the training of instructors and 
check-pilots. In the long run, through the use and 
study of video tapes, it should be possible to 
develop more objective, and perhaps even 
automatic, methods of evaluating pilot per- 
formance. 

The video-tape recordings, of course, do not 
contain all the cues which the instructor pilot may 
use in actual flight. This is especially true at the 
present level of development of the recording 
equipment. A major limitation, for example, is 
that at any given time the instructor can activate 
only one 6f two cameras, one aimed to get a view 
outside the airplane through the windscreen and 
the other focused inside on the primary flight 
instruments. However, even if considerable 
improvement were made in the video aspects of 
the system, it could still not represent such 
sensory information as that gained through 
kincsthesis. Hence, one of the questions to be 
answered is whether or not enough cues are 
represented with enough fidelity on the tapes to 
support accurate and reliable judgments of 
performance. This is a matter which can usefully 
be studied. It will be of particular importance to 
study if the Air Force determines that a system 
such as the audio-video recording system should be 
adopted for regular use in training. 

In any case, an essential first step in the study 
of the video recordings was to develop scale 
descriptions and evaluation procedures for a 
sample of maneuvers to establish the feasibility of 
using this type of recording in further research and 
training. With such scales and procedures, it could 
then be determined whether instructor pilots can 
evaluate performance appropriately and con- 
sistently from observation of video recordings. 

Purpose and Scope 

The purpose of this study was to develop a 
limited number of pilot performance reference 
scales, by means of which the performances 
represented in the video recordings could be 
judged. No more than six nor less than three of the 
maneuvers listed on Air Training Command (ATC) 
Forms 872 and 877 check grade sheets were to be 
selected for final scaling. 
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No particular form of the scales was specified in 
advance. A possible approach to the problem 
would have been to use the grading system cur- 
rently specified by Air Training Command. This 
system is a 4-point scale (U for Unable to accom- 
plish, F for Fair, G for Good, and E for Excellent) 
in which the points are defined generally, rather 
than separately and specifically for each maneuver. 
The instructors are, of course, familiar with this 
scale. However, with such general definition and 
broad application, one cannot be absolutely 
certain that the points on the scale have the same 
meaning from instructor to instructor and, for a 
given maneuver, what elements enter into the 
assignment of the grade. Hence, it was determined 
that the effort should aim at stales containing 
more than four points and should involve analysis 
of each maneuver so that each scale could refer 
specifically to that maneuver and essential 
elements in the performance of that maneuver. 

Preliminary Considerations 

The development of any scale requires that a 
variety of factors be taken into consideration. 
Some of these factors relate to practical problems, 
such as the constraints imposed by the nature of 
the stimulus material, while others are more 
theoretical in nature, such as the interval prop- 
erties desired for the scale or whether to take a 
multidimensional! or unidimensional approach. 
Two of the preliminary considerations of those 
factors which influenced the development of the 
present scales are discussed in this section: (a) 
Whether a multidimensional or unidimensional 
approach would be best in the present case; and 
(i b ) the type of interval properties desired in the 
scales. 

Dimensionality of Scales 

Fundamental to scaling is an initial considera- 
tion as to whether to use the multidimensional or 
unidimensional approach. In the present case, it 
was decided to take an approach which is not 
precisely one or the other, but which is more 
multidimensional than unidimensional. The ration- 
ale for this decision and the apparent advantages 
and disadvantages of each approach as it relates to 
the present project are discussed. 

The unidimensional approach. In a unidimen- 
sional approach it is assumed that a single 
dimension underlies the set of stimuli to be scaled 
and that judges are capable of discriminating the 



stimuli along this dimension. In the present case, 
such a dimension might be designated “goodness 
of pilot performance,” with very poor perform- 
ance at one extreme and very good performance at 
the other. 

The primary advantage of the unidimension al 
approach is that it is easier to conceptualize than 
the multidimensional approach. Consequently, the 
experimental design and data analysis could be 
more easily prepared. Secondly, there is greater 
efficiency in the use of a rater’s time, primarily 
because a rater need make only one judgment per 
stimulus with a unidimensional scale but must 
make several judgments per stimulus with multi- 
dimensional scales. Thirdly, most previous scaling 
work has been unidimensional, thereby providing 
more reference material. 

The mr 4 jor disadvantage in using the unidimen- 
sional approach is that it produces little informa- 
tion about the nature of the stimuli and processes 
of judgment. In the present instance, with a 
unidimensional approach, little would have been 
learned about the cues to which instructors 
actually respond or how they integrate informa- 
tion from several cues to arrive at an overall 
performance grade. 

The multidimensional approach. Judgments 
about pilot performance are very complex and are 
made up of more than one dimension. Since it was 
not feasible to establish, empirically, the number 
and type of dimensions underlying the set of 
stimuli to be scaled, the approach taken in this 
study was to make a priori evaluations concerning 
these dimensions. Specifically, pilot performances 
were separated into several performance elements. 
Conceptually, each performance element was 
considered to be a dimension. 

The principal advantage of the multidimen- 
sional approach in this study was that it required a 
determination and analysis of the components “of 
the complex stimulus dimension of pilot perform- 
ance. Dividing a pilot performance into perform- 
ance elements provides a greater opportunity to 
determine how instructors attend to these 
elements and how they integrate information 
about the different elements when making an 
overall evaluation of a pilot’s performance. 
Another advantage is that multidimensional 
performance element scales can readily be adapted 
to new maneuvers since these maneuvers would be 
made up, at least in large part, of performance 
elements already identified. A final advantage is 
one of an applied nature. Separate evaluations of 
each performance element would permit easy 



identification of the performance elements with 
which a student is having difficulty. For example, 
under the current grading system, two student 
pilots may be graded as performing a given maneu- 
ver in a “fair” manner. Yet they may be commit- 
ting entirely different errors. There is no means for 
determining from the grade what these errors arc 
or why the maneuver was graded “fair” (unless, of 
course, the instructor makes a written comment in 
the “Remarks” section of the grade sheet). Fur- 
thermore, it is not apparent from the grade what 
specific problem a student is having with a maneu- 
ver from day to day or, even, whether it is the 
same problem. This observation also holds true 
across maneuvers where there arc identical skills 
being learned (or not learned). For example, 
“effective 1 use of the power control” is one of the 
basic skills required of a pilot. This skill is one of 
those being taught by at least two of the maneu- 
vers selected for this study: Verticals “A” (VSA) 
and Final Turn to Landing (FTL). There is no way 
for an instructor to grade this skill on the current 
grade sheets. It appears that performance grading 
should be related to the particular skills being 
learned. Therefore, the effort at scale development 
was oriented toward this concept. 

The disadvantage to use of the multidimen- 
sional approach in this study lay in the difficulty 
of satisfying the requirements for dimensional 
independence and dimensional weighting. 

Ideally, the dimensions of a set of stimuli 
should be independent of one another. That is, a 
high grade on one performance element should not 
necessitate or be constantly associated with a 
particular grade level assignment on another 
dimension. The pilot performance elements which 
have been identified are not all completely 
independent; however, every attempt was made to 
reduce the number and degree of such depend- 
encies to a minimum. 

Although not explicitly dealt with in this study, 
weights should be assigned to every dimension, 
since each dimension is not necessarily equally 
important. In the present case where the dimen- 
sions consist of pilot performance elements, it was 
recognized that each performance element is not 
equally critical to the successful completion of a 
maneuver nor should they all be graded as equals. 
It was not obvious, though, how to weigh each 
performance element in precisely the correct way. 
Since the issue of validity was not to be tested for 
the performance scales in this study, the concept 



1 Definition of effective is not material to the 
discussion. 



of criticality (or relative weight assignments) was 
used primarily as one of the inputs for decisions as 
to whether or not to retain an clement as one of 
the group of pcrfonnance elements to be graded or 
considered in the scale. 

Scale Intervals 

Scales may be ordinal, interval, or ratio. It was 
considered important that the scales to be devel- 
oped in this project should be at least ordinal in 
order to have any real value in the practical 
application and evaluation of student performance 
by instructors. An interval scale was, or course, 
considered preferable. 

It is pertinent to note here that the current 
scales, as reflected by ATC Fonns 872 and 877, 
appear, at least in their descriptions, to be an 
admixture of the three scale types. The U, F, G, 
and E discriminations appear to be at equal 
intervals along a linear scale but, according to the 
numerical values indicated opposite each maneu- 
ver, the intervals are not always equal (e.g., the 
Lazy Eight values are 0 for U, 32 for F, 36 for G, 
and 40 for E). It is not known whether instructors 
actually use this scale in an equal or unequal 
interval fashion. Also, only the upper end of the 
scale is, to some degree, anchored. This anchor is 
the “ideal” or “perfect” maneuver performance. 
The perfect maneuver is described in official Air 
Force documents, but what constitutes acceptable 
performance variations to remain within the 
perfect (or E) envelope is not specified. Variations 
from the perfect performance for guidance in 
grading a performance less than E, as a G, F, or U, 
are not specified either. Grade assignments are 
made through instructor judgments based upon 
what he has learned at the instructor’s school, his 
experience as an instructor, and interaction with 
other instructors and check-pilots. 



II. DEVELOPMENT OF SCALES 

This section describes the steps taken to 
develop the scales reported in this study. Three 
concepts were use.d as guidelines during scale 
development. 

1. The scales should be usable by instructor 
pilots in an operational atmosphere. Although it 
was not the purpose or intent of this study to 
develop fully an operationally usable set of scales, 
it was felt that the final product, with modifica- 
tions, should be adaptable to such an environment. 

2. The scales should be as objective as possi- 
ble. It was believed essential to reduce the number 



of subjective judgments currently required of 
check-pilots and instructor pilots in grading the 
progress or performance of other pilots (student or 
rated). 

3. The scales should reflect the complexity of 
the pilot’s job. Some simplification was no doubt 
necessary for practicality, but oversimplification 
was to be avoided. 

Selection of Maneuvers 

As stated previously, pilot performance refer- 
ence scales were to be developed for not more 
than six nor less than three maneuvers. A maneu- 
ver was defined as any one of the line items listed 
on a T-37 Instrument Check Grade Sheet (ATC 
Form 877, February 1969) and T-37 Contact 
Check Grade Sheet (ATC Form 872, July 1968). 
The maneuvers were selected by the project 
monitor in consultation with, and on the basis of 
recommendations by, the study team. 

The first step was to select a set of maneuvers 
which were reasonably representative of the range 
of flying situations in undergraduate pilot training 
and which were likely to be amenable to analysis 
for scaling purposes. Another criterion for selec- 
tion was that there be a sufficient number of 
examples of each maneuver contained on video 
tapes for analysis and the conduct of reliability 
tests. 

The following six maneuvers were selected 
initially: Normal Pitchout, Final Turn to Landing, 
Slow Flight, Lazy Eight, Barrel Roll, and Vertical 
S (“A” and “D” versions only). These maneuvers 
are defined in ATC Manual 514 except for two 
minor modifications. The entry on ATC Form 872 
states “Normal Pattern and Pitchout.” The 
“Normal Pitchout” maneuver used in this study 
was the pitchout only-from the instant just prior 
to the point of pitchout through roll-out on to the 
downwind leg. Also, the study maneuver “Final 
Turn to Landing” actually is listed as three items 
on ATC Form 872: Normal Final Turn, Normal 
Final Approach, and Normal Touchdown. The 
only modification made in combining these three 
items was that the study maneuver is completed as 
the aircraft touches down, the roll-out after touch- 
down being excluded. 

A seventh maneuver was added at a later date. 
This was the complete Normal Landing Pattern 
from pitchout to touchdown, as previously 
defined; i.e., the downwind leg portion of the 



Normal Landing Pattern was added to make the 
entire maneuver a logical progression throughout. 

A reduction in the number of selected maneu- 
vers was effected during the initial phases of scale 
development. Video replays of all study maneuvers 
were thoroughly reviewed in order to establish a 
development base for the scales. As a result, the 
Barrel Roll and Slow Flight maneuvers were 
removed from consideration because neither 
contained a sufficient number of discriminable 
performance measures, observable on video replay, 
upon which to base the development of'scales. 

A further reduction of the set of maneuvers was 
effected following a preliminary test and evalua- 
tion of a set of interim scales. Results of this 
testing showed that the operational commitments 
of the instructor pilots, the limited time during 
which instructor pilots could realistically be 
expected to participate in the study, and the 
requirements for a statistically reliable base for 
determining the reliability of these scales necessi- 
tated the removal of the total Normal Landing 
Pattern and the Pitchout from the set of maneu- 
vers. The feasibility of deleting these two maneu- 
vers was enhanced by the fact that the three 
remaining maneuvers provided one example each 
of the low-level, high-level, and instrument phases 
of the syllabus. The maneuvers upon which the 
final performance reference scales were developed 
thus consisted of the Final Turn to Landing 
(FTL), the Lazy Eight (L8), and the Verical S “A” 
(VS A). 

Collection of Data Base 

The collection of the data base consisted of two 
primary steps: (a) selection and transcription of 
examples of each of the selected maneuvers from 
the original audio-video tapes, and ( b ) analysis of 
performance elements to serve as a framework 
upon which the scales were to bo developed. 

Audio-Video Tapes 

The original audio-video recordings of in-flight 
performances were made during Phase III of 
Contract F33615-68-C-1048 at Vance Air Force 
Base. These recordings were of actual student 
performances throughout the T-37 contact and 
instrument phases of undergraduate pilot training 
and were contained on V6-inch video tapes. The 
video playback units associated with the Vi-inch 
system do not have slow-motion or stop-action 
features. The development and use of pilot 



4 



12 



performance reference scales made such features 
mandatory. Therefore, it was necessary to have a 
recorder which had these features and was fully 
compatible with the equipment provided under 
Contract F33615-68-C-1048. The result was the 
purchase of a SONY EU-210/VTE-4 1-inch 
recorder and support equipment. Examples of the 
selected maneuvers contained on the '/4-inch tapes 
were transcribed and organized onto maneuver- 
specific 1-inch tapes. The audio portion of the 
'/4-inch tape was also recorded onto the 1-inch 
tape, and the student’s name, instructor, date, and 
overall grade assigned were recorded on a second 
(special) audio track. 

Even though recordings were made of student 
performances throughout their T-37 training 
(contact and instrument only), it was felt that 
students might not achieve the highest levels of 
proficiency during this time. In order to assure 
that there would be examples of each of the 
selected maneuvers representing high levels of 
performance (i.e., ideal or perfect performance), 
special '/4-inch in-flight recordings flown by expert 
pilots (i.e., instructor pilots) were also collected. 
These examples were, as with student perform- 
ances, re-recorded on the appropriate 1-inch tape 
and thus became part of the data base. The 
inventory of the number of examples of each 
maneuver contained on 1-inch video tape is as 
follows: 



Maneuver 


Number of 
Examples 


Pitchout 


60 


Final Turn to Landing 


94 


Normal Landing Pattern 


17 


Lazy Eight 


33 '/4 


Vertical S (“A” and “D”) 


70 


Slow Flight (dropped) 


30 (collection 
not completed) 


Barrel Roll (dropped) 


26 (collection 
not completed) 


Each example was further identified in terms of 



its location on the reel (using counters provided), 
whether the maneuver was flown by a student or 
instructor, preliminary remarks as to the pilot 
performance deficiencies illustrated on the video 
recording, the quality of the recording, and an 
estimated grade (U, F, G, or E) for the particular 
example. The latter grade was assigned by the 
project staff purely for use as a guideline to indi- 
cate relative quality of pilot performance for fast 
retrieval of examples of maneuvers at different 
performance levels. 



Performance Elements 

Another input into the data base from which 
the pilot performance reference scales were de- 
veloped was the formulation of performance 
elements. These elements arc descriptions of 
segments, activities, conditions, or skill require- 
ments which, when totaled, describe a maneuver. 
As will be seen, these elements were initially 
developed for each of the study maneuvers and 
then refined into a set of elements each of which 
could be applicable to more than one of the 
selected maneuvers. 

Appendix I is a table of the initial performance 
elements developed so as to gain greater insights 
into performance requirements of the maneuvers 
to be scaled. The elements are maneuver-oriented; 
that is, each maneuver was analyzed to determine 
the performance elements applicable to that 
maneuver. A better approach to developing 
performance elements might have been to start 
with the skills required of an Air Force pilot based 
upon task analyses and then to relate the skills to 
be learned to the maneuvers which have been 
included in the current undergraduate pilot train- 
ing syllabus to teach these skills and which are of 
concern to this study. However, the scope of the 
study did not permit such an approach. 

The sources of information which provided the 
basis for Appendix I were as follows: 

ATC Manual 514, 12 June 1967. 

RAFB Student Study Guide Fl 115070-5, June 
1967. 

Written comments from instructors which 
included common student errors. 

Verbal comments from a panel of instructors 
from Vance Air Force Base. 

(These instructors are not only highly 
qualified as flight instructors but all have 
had some degree of experience with the 
Audio-Video Recording System.) 

Instructor comments contained in the “Re- 
marks” section of completed T-37 Contact 
and Instrument Check Grade Sheets. 

Study staff experience with these maneuvers. 

Each maneuver is represented by named 
segments. That is, there is a general or overall 
segment which presents an overview of the 
maneuver followed by logical groups within 
sequential blocks of time which, together, make 
up the total maneuver. The L8 maneuver, for 
example, is made up of nine groups: overall and 
eight logical checkpoints (45°, 90°, 135°, 180°, 
225°, 315° and 360°) throughout the maneuver. It 



is recognized that other breakdowns or groupings 
of maneuver performance measures could have 
been made. The column headings arc explained as 
follows: 

“ Activity or Condition. ” The items listed under 
this heading are concerned with what is expected 
of the pilot during that particular segment of the 
maneuver and what the initial or ending condition 
or result should be. For example, there arc given 
conditions which must exist before a pilot starts 
an L8 maneuver. These arc given in tire “Overall” 
section. Then, too, at the 45° point there arc 
certain conditions which arc considered ideal for 
proper performance of the maneuver. Also listed 
in this column are the activities which contribute 
to ideal maneuver performance. 

"Indicator or Sense.” The items under this 
column include the instrument or sense which is 
used by an instructor to make judgments about 
the performance of the activity or condition. One 
item, “feel,” requires some explanation. It is 
meant to convey that accumulation of all the cues 
a pilot receives from his environment, instruments, 
and senses which result in a greater awareness of 
the performance or judgments as to the degree of 
“goodness” of performance. “Feel” is also used in 
a simpler context such as a requirement to touch 
the landing gear lever during a “gear down and 
locked” check to ascertain that one condition of 
the landing gear being in the extended position is 
met. 

"Indications or Stimulus.” The items listed in 
this column reflect the manifestation of the 
activity or condition. 

"Decision Factors.” These items are, as the 
name implies, factors which must be considered, at 
a minimum, by the instructors who grade the 
performance of the given activity or condition. 

“Performance Criteria.” The items in this 
column were originally intended to convey the 
performance parameters which would indicate the 
degree of “goodness” of an activity or condition. 
As stated previously, there are no such criteria 
known to exist officially at this time and. there- 
fore, the item is either listed as “none” or “IP 
judgment.” The latter is meant to indicate that the 
basis for judging the degree of “goodness” of that 
particular facet of student performance is left to 
the discretion of the instructor pilot. The item 
“none” implies just thnt-thcre is no criterion to 
define the degree of “goodness.” For example, in 
the setup for the L8, the manual calls for 200 
knots as the starting airspeed. What if a pilot starts 
the maneuver at some speed other than 200 knots? 



Under the current grading system, at what airspeed 
other than 200 knots docs his performance 
become G, F, or U? What arc the airspeed limits 
for an E performance? This is, obviously, a simple 
example; it becomes much more complex consid- 
ering a maneuver as a whole or a complicated seg- 
ment of a total maneuver, such as the final 
approach to a landing. 

"TV System Capability.” The items “Yes” and 
“No” indicate that the performance of a given 
activity or condition can or cannot be observed to 
some degree through use of an audio-video record- 
ing system. 

"Criticality of Performance.” The items in this 
final column were preliminary judgments made by 
the project staff as to the relative importance or 
contribution of the performance of a given activity 
or condition to the overall performance of a per- 
fect maneuver. It is a simple scale of 3 for ex- 
tremely important, 2 for moderately important, 
and l for minor importance. The information in 
this column was used as a guideline during the 
development of the scales as an indication of prior- 
ity. Appendix I does not include the Barrel Roll or 
Slow Flight maneuvers which had been, as previ- 
ously stated, dropped from study consideration. 

The content of Tablel was prepared from the 
data contained in Appendix 1. The performance 
elements were simplified and assigned an identifi- 
cation number for future data control. They were 
also worded such that each element was general- 
izable across maneuvers with specific definitions 
reserved for the applicable maneuver. Table 1 also 
shows the applicability of a given element to the 
five maneuvers still under consideration at this 
point in the study. The “NA” notation indicates 
that the given element is not applicable or is of 
minimal importance to the successful completion 
of an ideal performance or has' minimal effect on 
the overall grade assigned to the maneuver (e.g., 
element 2B, the control of airspeed during a 
Pitchout). No notation indicates that the element 
is applicable. The reason for showing the Lazy 
Eight as two maneuvers is explained in the follow- 
ing section. 

In summary, Appendix I and Table 1 are the 
results of establishing identifiable variables 
associated with the performance of a given 
maneuver and with the grading of the maneuver. 
Some identifiable variables are, of course, not 
included since they could not be judged through 
utilization of an Audio-Video Recording System 
(AVRS). 



Table 1. Applicability 3 of Performance Elements to Maneuvers 



Maneuver 



Final Normal Lazy Eight 

Pitch- Turn to Landing 5 o 

Performance Element out Landing Pattern First 180 Second 180 Vertical S "A" 



1. Specific check point 
“book” criteria or 
setup requirements 

A. Airspeed 

B. Altitude 

C. Attitude 

D. Heading 

E. Positioning of Aircraft 

F. Aircraft Configuration 

G. Trends of: 

1 . Airspeed 

2. Altitude 

3. Attitude 

2. Control of: 



A. Power 

B. Airspeed 


NA 




C. Altitude 

D. Heading 

E. Pitch Angle 

F. Rate of Roll 




NA 


G. Angle of Bapk 
11. Rate of-Turn 
1. Rate of Pitch Change 


NA 


NA NA 


J. Rate of Ascent/Descent 


NA 




3. Cross Check 

4. Error Correction 

5. Transitioning 


NA 


Roundout Roundout 


6. Use of-Trim 

7. Safety (clearing turns 
or spacing) 






8. Cockpit Procedure (Audio 
only) 


NA 




9. Use of Ground Reference 
Points or Lines 

10. Aircraft Configuration 

1 1. Aircraft Operation Within 
Published Limitations 

12. Touchdown 


NA 




13. Radio Procedure (Audio 
only) 


NA 





■’NA = Not applicable, or of minimal importance. 



NA 

NA 



NA 

NA 

NA 

NA 

NA 

NA 

NA 

NA 

NA 



NA As Directed 

As Directed 
NA 
NA 

NA 

NA 

NA 

NA 

NA As Directed 



NA 


NA 

NA 

NA 

NA 


NA 

NA 




NA 


NA 


NA 


NA 


NA 


NA 

NA 


NA 

NA 


NA 

NA 


NA 


NA 



Preliminary Performance Scale 
Development 

As in most scale developments, plans were 
made to develop and test a preliminary set of 
scales in order to determine whether the approach 
being taken was reasonable and merited continued 
efforts or required a change. An additional incen- 
tive for developing a preliminary set of scales was 
that it could be used to learn more about the 
maneuvers and about the process through which 



an instructor relates performance to a scale or a 
chosen point on the scale. One way to identify 
those obscure or poorly understood aspects of 
factors an instructor uses to make judgments on 
levels of performance is to ask instructors to 
verbalize what they are seeing or looking for 
during a performance. An instructor confronted 
with an incomplete preliminary scale for use as a 
reference upon which to mark a level of perform- 
ance would be provided with the requisite stimulus 
to verbalize. 
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Fig. 1. Sample page of preliminary pilot performance reference scale. 



Figure 1 is a representative example of the 
preliminary scales that were developed. A separate 
set of scales was developed for each of the five 
maneuvers, with a double set for the Lazy Eight. 
The latter maneuver was divided into two parts, 
the first 180° and the second 180°. This was due 
to observations from the video tapes and com- 
ments by the instructors that a student generally 
performs that part of an L8 which goes to the 
right better than the one which goes to the left 
because of the side-by-side cockpit arrangement in 
the T-37 aircraft. Therefore, separate evaluations 
seemed to be appropriate. 

To continue with the explanation of Figure 1 , 
the first element for each maneuver was the “Over- 
all” performance. The scale developed for grading 
the overall performance was a 12-point, equal- 
interval, unidimensional scale anchored near the 
two extremes with a U and an E. The primary 
objective behind this scale was to provide the 
grader with a greater number of discrimination 
possibilities than was thought to be required to 
effectively grade the performance. It was desired 
to relate the scale to the current 4-point grading 
system, yet not to restrict the grader to a rigid 
relationship. Therefore, only the U and E were 
placed on the scale. As can be easily ascertained, 
the 12 points can represent U-, U, U+, F— , F, F+, 
G-, G, G+, E— , E, and E+. Tests of this prelim- 
inary scale were to determine how many discrim- 
inations were made by the instructors in grading a 
series of maneuvers. No verbalization was 
attempted to describe the points along this overall 
grading scale. Such a task would be forbidding 
when one considers the number of ways a 
maneuver could be performed in a “good” 
manner— especially, complicated maneuvers such 
as the Normal Landing Pattern. With a verbalized 
scale, the “good” point, for example, would have 
to contain a description of each possible way of 
doing the maneuver which could result in a 
“good” grade. Such descriptions come within the 
realm of possibility when maneuvers are broken 
down into the segments or elements which make 
up a maneuver. Each element can be more readily 
verbalized along its dimension. 

Again referring to Figure 1, it can be seen that 
each element contains at least two and sometimes 
four verbal descriptions across its dimension. The 
scale used for each of the elements was a 4-point 
scale with no pretense that it was anything differ- 
ent from the current 4-point system; the only 
difference was the addition of the descriptions. 
Parts of the scale were left blank purposely, in 



order to take advantage of verbalizations solicited 
during the tests of these preliminary scales. 

The final item to be explained on Figure 1 is 
the “Ideal” column. The purpose of this entry was 
two-fold. First, it provided a definition of the 
performance element associated with it and, 
second, it served as a “reminder” to the instructors 
as to what the ideal or perfect value or perfonn- 
ance requirement was as cited in the ATC Manual 
514 and other sources. 

Before reporting the results of the test and 
evaluation of the preliminary scales, it is important 
to comment on an overall consideration which 
influenced all scale development. It was necessary 
to use personnel currently assigned as instructor 
pilots in testing the reliability of any developed 
scales. Except, perhaps, during unpredictable 
periods of foul weather, these instructors are 
extremely occupied in the performance of their 
primary duty— that of training student pilots. It is 
axiomatic that any scale developed which was 
unfamiliar or completely foreign to an instructor 
could not achieve acceptable reliability without 
extensive training as to its use. Such training 
would not have been operationally feasible. The 
decision to develop a scale which was relatable to 
the 4-point system and as easy as possible to 
comprehend and use was based on this considera- 
tion. This is not meant to be construed, neces- 
sarily, as a study limitation. 

The preliminary scales, then, consisted of two 
parts: a 12-point, apparently equal-interval linear 
scale for use in grading the overall performance of 
a study maneuver, and a 4-point verbal scale for 
each of the performance elements defined for the 
maneuver or segment. 

Test and Evaluation of the 
Preliminary Scales 

Tests of the preliminary scales were carried out 
at Vance Air Force Base over a period of three 
days with nine volunteer instructors as partici- 
pants. The objectives of the tests were as follows: 

1. To gain further insights into the perform- 
ance elements and to obtain additional inputs for 
possible refinement of the elements. 

2. To obtain better verbalizations for each dis- 
criminable point on the developing scales. 

3. To determine the frequency ^of use of each 
of the twelve points provided on the preliminary 
scales. Note that this objective does not include 
the determination of whether an instructor could 




reliably discriminate between a G— and an F+, for 
example, but only the use and frequency of such 
discriminations. 

4. To determine, from the examples of maneu- 
vers available to the study , which of the perform- 
ance elements could be graded on the basis of the 
video information presented. As in the previous 
objective, the accuracy of such grades was not an 
issue. 

5. To obtain expert opinions from instructor 
pilots as to which of the performance elements 
were most critical to the successful (i.e., ideal) 
completion of a given maneuver. 

6. To obtain data, to the maximum extent 
possible, upon which to base judgments as to how 
many instructors would be required to judge how 
many examples of each maneuver for the formal 
test and evaluation of the Final scales still to be 
developed, in order to (a) provide the data 
required for statistical treatments, the results of 
which would be interp re table and applicable to 
determining the reliability of the performance 
scales, and ( b ) provide the basis for requesting 
only as many instructors as necessary for testing 
scale reliability (i.e., the basis for making a realistic 
request to the operational command for instructor 
pilot participation). 

It was not an objective of this test to determine 
whether or not such scales could be effectively 
utilized in an operational context. The planning, 
conduct, and results of the test are briefly 
described in the following paragraphs. 

In addition to preparing and developing the 
preliminary scales, the planning phase consisted of 
preparing the test video tape, preparing the criti- 
cality form, and preparing the guidelines for on- 
site conduct of the test. In order to obtain the 
maximum amount of data in the minimum of time 
and to effectively use the instructor participants, a 
special video tape was prepared. Six test examples 
each of the Normal Landing Pattern, Pitchout, 
Vertical S “A,” Final Turn to Landing, and Lazy 
Eight were selected from the inventory of exam- 
ples. In addition, one other example of the Normal 
Landing Pattern, Vertical S “A,” and Lazy Eight 
were selected. The latter examples were used to 
orient instructors not familiar with using the 
Audio-Video Recording System to make judg- 
ments relative to pilot performance through the 
medium of a television screen. The test examples 
for the Five maneuvers were selected by consider- 
ing each of the following factors: (a) clarity of 



recordings; ( b ) appropriateness of the mix of 
inside and outside views; (c) degree to which it was 
judged that meaningful indications of performance 
could be identified from the recordings; (d) extent 
of variability of each of the performance elements; 
and ( e ) prejudgments of overall maneuver perform- 
ance levels which would, hopefully, show different 
degrees of performance across the scale from 
“bad” to “good.” The examples on the video tape 
were organized for fast identification and selected 
retrieval. 

The audio portion of the audio-video tape was 
not transcribed onto the test tape. The reason for 
this was the number of instructional comments 
that were made on the tapes when originally 
recorded “live” which would bias the evaluation of 
the performance by an instructor other than the 
one who actually flew the mission. Independent 
judgments were mandatory in the test situation. 
This does not connote that the audio is of no value 
in making judgments on performance levels. In the 
“live” or operational situation, it would be very 
valuable in assisting the instructor who flew the 
mission to recall events for more accurate record- 
ing of performance levels on official grade sheets, 
and in providing a means for recording critical 
information on the audio portion of the tape when 
the instructor desires to remain with the outside 
view camera (for example, the airspeed at the 90° 
positions in the L8, or the altitude and airspeed 
halfway around in the FTL). 

A “criticality of performance” form was also 
prepared in order to format instructor response to 
the question of criticality. This form is illustrated 
by the example shown in Figure 2. Each of 13 
performance elements and their subdivisions (only 
the first three with their subdivisions are shown in 
Figure 2) were listed, followed by the “ideal” per- 
formance already explained. These elements with 
appropriate “ideals” were prepared for each of the 
five maneuvers. The instructors were asked to fill 
in the third column with a number from 1 to 5 
indicating the degree of criticality of that element 
to successful maneuver performance. Using this 
scale, a 1 indicated that the element was of minor 
importance, a 3 indicated that it was of moderate 
importance, and a 5 indicated that it was of ex- 
treme importance. The instructor was also given 
the opportunity, in the “Remarks” column, to ex- 
press his opinion as to the relevancy, definition, or 
whatever of any of the performance elements, or 
to provide missing performance elements. 
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MANEUVER: VERTICAL S "4* 



Performance Element 


\ Cr 
\of Pi 

“Ideal" \ 


itlcallty 

erformano 

(1-5) 


7 

/ Remarks 


1. Specific checkpoint 
■book" criterion or 
setup requirements 

A. Airspeed 


160 kts 






p. Altitude 


steady as directed l. 




C. Attitude 


level 






D. Heading 


steady as directed 






E. Positioning of a/c 


NA 


NA 


NA 


F. A/C configuration 


clean 






6. Trends of 

1) Airspeed 

2) Altitude 

3) Attitude 


constant 

holding 

steady. 






2. Control of: 
A. Power 


sufficient to control 
rates of ascent/descent & 
a/s throuah transition 






B t Airspeed 


constant 






C. Altitude 


as directed 






D. Headlna 


constant 






E'. Pitch angle 


sufficient to control a/s 
& reverse direction of 
vertical movement of a/c 
through transition 




NA 


F. Rate of roll 


NA 


NA 


6. Angle of bank 


NA 


NA 


NA 


H. Rate of turn 


NA 


NA 


NA 


I. Rate of pitch change 


NA 


NA 


NA 


J. Rate of ascent/ 
descent 


1000* per min. 






3. Crosscheck 


continuous crosscheck of 
Instruments 















NA = Not applicable, or of minimal importance 



Fig. 2. Sample page of form for recording criticality of performance. 






Table 2. Distribution of Frequency of Scale Point Usage 
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Maneuver U— U 



Frequency of Grades Assigned on 12-point Scale 



U+ 



F— F 



F+ 



G— G 



G+ 



E— 



E+ 



Total 
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Lazy Eight 
(1st 180°) 
Lazy Eight 
(2d 180° and 
overall) 

Final Turn 
To Landing 
Vertical S “A” 
Normal Landing 
Pattern 
Pitchout 

Total 

Row A 
Row B 




The need for an informal atmosphere during 
the tests in order to obtain maximum flexibility 
and utilization of available instructor times neces- 
sitated the preparation of minimal guidelines. Such 
guidelines actually took the form of procedures 
for test conduct. 

In order to achieve maximum independence of 
judgments, each instructor was taken individually. 
The two-hour time limitation did not allow an 
instructor to grade examples of all five maneuvers. 
Therefore, the instructors graded examples of the 
Normal Landing Pattern and examples of two of 
the remaining four maneuvers. Each session con- 
sisted of four parts: introduction to the test, 
viewing of each example followed by grading using 
the preliminary performance scales, a discussion of 
the example just viewed, and the completion of 
the criticality of performance forms. During the 
introduction to the test, each instructor was 
briefed as to the objectives of the study along with 
the need for his inputs and expectations there- 
from. He was also given an opportunity to view 
and discuss the three preliminary examples from 
the video tape. An example of a maneuver was 
then shown. The instructor was asked to assign a 
grade to the overall maneuver and then to each of 
the performance elements by checking the appro- 
priate box. The instructor was allowed to view the 
example, or parts thereof, as many times as desired 
since the objectives of the test did not include 
testing the capability of an instructor to observe 
all the cues or retain what he had seen after a 
single viewing. In practice, this opportunity was 
seldom used. An audio tape recorder was used to 



record the discussion of each maneuver after it was 
shown and the grading completed. 

Further insights into the performance elements 
and recommendations for better verbalizations (as 
well as for filling the gaps) at major points on the 
scale were, in fact, realized from the test results. 
The audio tape (containing maneuver discussions) 
and the modifications to the preliminary scale 
made by instructors during the actual grading of 
performances provided the information required. 
Instructors had been asked to make any changes 
they desired to the performance elements and 
scale verbalizations that were contained on the 
preliminary scales. As a result, instructors changed 
some values (such as ±5 knots) given under specific 
performance levels and Filled in some blanks in the 
scale with values or words as to why that partic- 
ular level of performance was assigned. This 
information, along with some minor inputs from 
the completed criticality of performance forms- 
and the marking of the “Not Observed” column 
on the preliminary scale forms, showed which of 
the performance elements could and could not be 
reliably graded on the basis of the video informa- 
tion shown. 

Table 2 shows the frequency of use of each of 
the 12 scale points. For easy reference and trans- 
ferability to current 4-point scales, the columns 
•have been headed U— , U, U+, F— , F, F+, G— , G, 
G+, E— , E, E+. There were 130 observations, with 
the greater number of grades being assigned to the 
upper half of the scale. It should be noted that the 
shape of this distribution, and of distributions like 
it, is readily affected by the extent to which the 
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different performance levels are represented in the 
examples to be graded. A set of examples selected 
primarily from early stages of flight training would 
probably produce a grade distribution skewed in 
the opposite direction. This table shows that 10 of 
the 12 points were used in grading the video exam- 
ples used. The figures in the row labelled “A” 
show the probable distribution had there been a 
4-point scale, and those in row “B” show the 
probable distribution had there been a 7-point 
scale. This latter grouping appears most interesting 
in that the scale point F+/G— received the greatest 
usage by the instructors for the maneuver exam- 
ples shown. It is not known whether instructors 
would have used that point with the same fre- 
quency had they been given a 7-point scale. 
Except on a relative basis the precise position of 
the point F+/G- on the scale is not known either. 

Table 3 is a summary of the average criticality 
values assigned by the instructors to each of the 
elements by maneuver. Only the three maneuvers 
on which the final scales were developed are 
shown. Each of the values in the cells opposite the 
elements was obtained by adding the criticality 
numbers assigned by all the instructors and 
dividing the result by the number of instructors. 
An average value for each maneuver was then 
computed and elements with a criticality value 
equal to or greater than this average were high- 
lighted. This table was used as an input to deci- 
sions made relative to the performance elements in 
developing final scales. As could be expected, the 
upper end of the l-through-5 scale was that 
primarily used for value assignments, with the 
numbers 1 or 2 being used rarely. 

Finally, the data from the test of the prelim- 
inary scales were used to provide the basis for 
judgments resolving two concerns relative to the 
test and evaluation of final scales. The first 
concern involved the number of instructors 
required to serve as judges in the test and the 
second involved the number of examples of each 
maneuver to use in the test. These two concerns 
are dependent upon tire variability to be expected 
in the data. The data from the test of preliminary 
scales provided information about variability 
which was used to estimate variability in the test 
of the final scales. 

Development of Final Pilot 
Performance Reference Scales 

The final development of pilot performance 
reference scales was based upon the three 
maneuvers selected. Final Turn to Landing (FTL), 



Vertical S “A” (VSA), and Lazy Eight (L8), and 
upon the data collected to support their develop- 
ment. The preliminary pilot performance reference 
scales were revised by making additional refine- 
ments to the performance elements, by completing 
the verbalizations along the element scales, and by 
expanding the number of points in each element 
scale. The complete scales are contained in 
Appendix II. 

The refinements to the performance elements 
consisted first of improving the wording and 
definitions (Ideal). Another kind of refinement 
involved reducing the number of elements to a 
minimum. The elements retained were those most 
relevant in contributing to the assignment of an 
overall grade and which were clearly capable of 
being individually judged as to performance levels 
using the video system. The criticality scores and 
an analysis of the completed preliminary perform- 
ance scales provided the major inputs into deci- 
sions to retain, discard, or revise the wording of an 
element. A review of the audio tapes obtained 
during the discussion sessions provided additional 
insights into the elements. An example of the 
latter was the elimination of performance element 
6, use of trim. It was obvious from the tapes that 
some of the instructors were grading this element 
based upon inferences that were being made (and, 
most probably, reliable inferences) from the way 
the aircraft was being flown or from the results 
achieved. Although this capability was interesting 
to observe, grades which were not assigned as a 
result of judgments made from direct observations 
through the video system were not considered 
acceptable. Table 4 is a comparative summary of 
performance elements which were used In the 
preliminary scales and those used in the final scales 
for the three selected maneuvers. For the Final 
Turn to Landing, elements ID (heading set up), IE 
(positioning of aircraft at start of final turn), 1G 
(trends of airspeed, altitude, and attitude), 2A 
(control of power), 2H (control of rate of turn), 6 
(use of trim), 7 (safety), and 12 (touchdown) were 
removed from the final scales. The reasons for 
their elimination were as follows: 

ID: A critical input into this judgment would 
be a known wind condition. This could not 
be ascertained from the video portion of 
the tape. (Audio would correct this defi- 
ciency.) 

IE, 1G, 2A, 6, 7: Could not judge from the 
video system. 

2H: Pilot performance is better judged by ele- 
ment 2G. Element 2H also duplicates 
element 2G. 
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Table 3. Average Criticality Ratings of Performance Elements 







Criticality Value for Maneuver 




Performance 

Element 


Final Turn 
to Landing 


Vertical S "A” 


1st 180° 


Lazy Eight 
2d 180° 


Overall 

Average 


1A. Airspeed set up on entry 


4.0 


4.5 * 


3.4 


4.0 * 


3.7 


IB. Altitude set up on entry 


4.1 * 


4.25* 


2.75 


NA 


2.75 


1C. Attitude set up on entry 


3.6 


4.4 * 


4.1 * 


3.6 


3.9 * 


ID. Heading set up on entry 


3.6 


4.5 * 


3.4 


3.75 


3.6 


IE. Positioning of aircraft at 
entry 


3.5 


NA 


3.9 * 


3.75 


3.8 


IF. Aircraft configuration 


4.9 * 


3.2 


3.75 


3.75 


3.75 


1G. Trends on entry of airspeed, 
altitude, attitude, and 
heading 


4.0 


4.25* 


3.1 


NA 


3.1 


2A. Control of power 


4.75* 


4.4 * 


NA 


NA 


NA 


2B. Control of airspeed 


4.75* 


4.1 


3.25 


3.75 


3.5 


2C. Control of altitude 


3.5 


3.4 


NA 


NA 


NA 


2D. Control of heading 


3.75 


4.0 


4.1 * 


4.0 * 


4.1 * 


2E. Control of pitch angle 


3.9 


4.5 * 


4.0 * 


4.1 * 


4.1 * 


2F. Control of rate of roll 


NA. 


NA 


4.1 * 


3.9 


4.0 * 


2G. Control of angle of bank 


3.9 


NA 


3.9 * 


3.9 


3.9 * 


2H. Control of rate of turn 


3.3 


NA 


4.0 * 


4.0 * 


4.0 * 


21. Control of rate of pitch 
change 


NA 


NA 


4.1 * 


3.9 


4.0 * 


2J. Control of rate of ascent/ 
descent 


4.1 * 


4.5 * 


NA 


NA 


NA 


3. Crosscheck 


4.4 * 


4.9 * 


4.4 * 


4.6 * 


4.5 * 


4. Error correction 


4.25* 


4.6 * 


4.2 * 


4.3 * 


4.25* 


5. Transitioning 


3.75 


4.0 


NA 


NA 


NA 


6. Use of trim 


4.1 * 


4.0 


NA 


NA 


NA 


7. Safety (clearing or 
spacing) 


4.75* 


NA 


NA 


NA 


NA 


8. Cockpit procedures 


4.25* 


NA 


NA 


NA 


NA 


9. Use of ground reference 
points or lines 


3.9 


NA 


4.1 * 


4.1 * 


4.1 * 


10. Aircraft configuration 


4.9 * 


3.67 


4.2 * 


4.2 * 


4.2 * 


11. Aircraft operation within 
published limitations 


4.4 * 


4.7 * 


4.3 * 


4.4 * 


4.4 * 


12. Touchdown 


4.1 * 


NA 


NA 


NA 


NA 


13. Radio procedure (audio 
only) 


NA 


NA 


NA 


NA 


NA 


Overall Average 


4.1 


4.2 


3.8 


4.0 


3.9 



Note. — Asterisk indicates the element is considered to be average or above in criticality. 
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Table 4. Disposition of Performance Elements in Preliminary 
and Final Reference Scales 






Performance Element 


Element Used to Evaluate 
Final 

Turn to Lazy 

Landing Eight 


Maneuver 

Vertical 
S “A" 


1A. Airspeed set up on entry 


X 0 


X 0 


X 0 


IB. Altitude set upon entry 


X 0 




X 0 


1C. Attitude set up on entry 


X 0 


X 0 


0 


ID. Heading set upon entry 


0 


0 


X 0 


1 E. Positioning of aircraft at entry 


0 


X 0 




IF. Aircraft configuration 

IG. Trends on entry of airspeed, 


altitude, attitude, and heading 


0 


0 


X 0 


2A. Control of power 
2B. Control of airspeed 


0 

X 0 


X 0 


X 0 


2C. Control of altitude 


X 0 




0 


2D. Control of heading 


X 0 


X 0 


X 0 


2E. Control of pitch angle 


X 0 


X o 


X 0 


2F. Control of rate of roll 
2G. Control of angle of bank 


X 0 


X 0 
X 0 




2H. Control of rate of turn 


0 


0 




21. Control of rate of pitch change 
2J. Control of rate of ascent/descent 


X 0 


X o 


X 0 


3. Crosscheck 


4. Error correction 


X 0 


X o 


X 0 


5. Transitioning 


X 0 




X 0 


6. Use of trim 


0 




0 


7. Safety (clearing or spacing) 


0 






8. Cockpit procedures 

9. Use of ground reference 


points or lines 


X 0 


0 




10. Aircraft configuration 

11. Aircraft operation within 
published limitations 


12. Touchdown 


0 






13. Radio procedure (audio only) 


Note. — x indicates element was used in 
o indicates clement was used in 

Critical inputs to a judgment of perform- 


final scale, 
preliminary scale. 

9: Identical 


meaning 


to element 21 


ance level are missing on video, such as 


Lazy Eight context. 
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hardness of landing, airpspeed at touch- 
down, precise moment aircraft touches 
ground, and precise place of touchdown. 

Elements ID, 1G, 2H, and 9 (use of ground refer- 
ence points) were removed from the Lazy Eight 
scale for the following reasons: 

ID: Identical meaning to element IE in the 
Lazy Eight context. 

1G: Of minimum importance with no set ideal 
criteria. 

2H: This element is a function of elements 2G 
and 21 and would, therefore, be a repeti- 
tious grade. 



Elements 1C (attitude set up), 2C (control of alti- 
tude), and 6 were removed from the Vertical S 
“A” scale for the following reasons: 

1C: Identical meaning to element 1G in the 
Vertical S “A” context. 

2C: This element is part of element 5. The alti- 
tude, per se, is not important except as a 
basis to judge the capability of a student to 
transition. 

6: Could not judge from the video system. 
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Fig. 5. Sample page of a preliminary pilot perfonnance reference scale. 
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Fig. 4. Sample page of a pflot performance reference scale. 
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A comparison of Figures 3 and 4 illustrates the 
types of changes in wording or clarity that were 
made to the performance elements (e.g., sec 
element 2B). A similar inspection exemplifies the 
additions and modifications made to the verbaliza- 
tions at the four points along the scale. Also 
illustrated is the revision of format that was ac- 
complished for the final scales. The format 
modification consisted primarily of three parts. 
First, the 12-point scale used in the preliminary 
performance scale was modified to a 10-point 
scale. This change was based primarily on the 
results of the preliminary test shown in Table 2. 
Except for one occasion, the extreme upper and 
lower points (i.e., the U— and E+) were not used. 
Hence, it seemed appropriate to discontinue 
depicting those points. It was felt that further 
reductions to the number of scale points, at this 
stage of scale development, was not warranted 
from the results of the preliminary tests. The 
second format change was concerned with the 
application of the 10-point scale to each of the 
performance elements in addition to its 
application for an overall grade (the 12-point scale 
applied only to the overall grade and a 4-point 
scale to the performance elements in the prelim- 
inary scales). The third change applied to the 
elimination of the U and E from the scale. 
Although, conceptually, these values and the F 
and G values can be superimposed on the scale, it 
was desired to move as far away from such a depic- 
tion as possible while still retaining the trained 
instructors’ familiarity with the UFGE system. 
The four verbal descriptions are meant to fit the 
scale, from left to right on Figure 4, as follows: 



U 



u+ 



F- 



F+ 



G- 



G+ 



E- 



In summary, the final pilot performance refer- 
ence scales consist of performance elements 
common across maneuvers (but not necessarily 
applicable to every maneuver) which describe the 
maneuver in terms of the skills being learned or a 
required condition of flight. The level of perform- 
ance for each element is graded on a 10-point 
scale. The upper (or Ideal) and lower (or Unsatis- 
factory) parts of the scale and two additional 
points are verbally described as to the level of 
performance required at those approximate points 
on the scale. These verbalizations are specific to 
the maneuver and the performance element within 
the maneuver. 



III. TEST AND EVALUATION OF SCALES 

This section reports on the conduct and results 
of the test and evaluation of the final pilot 
performance reference scales. The objectives of 



this concluding phase of the study were to deter- 
mine the number of discriminations that can be 
made for each of the three selected maneuvers and 
to determine the reliability of these scales. As has 
previously been noted, the notion of determining 
the validity of the scales was abandoned when it 
was concluded that there was no basis, at this 
point in scale development and with the current 
UFGE grading system, upon which- to assess what 
the scale was truly measuring. There were two 
primary factors which led to this conclusion. First, 
there is no official documentation which defines 
the maneuvers in terms of differing degrees of 
“goodness” of performance. Secondly, the grades 
which were assigned to the maneuvers by the 
instructor who actually flew the mission “live” 
(and which, in the majority of cases, are included 
in the study data base) could not be used as a basis 
for establishing validity— the latter instructor had 
more inputs into the decision as to what to assign 
as a grade than are provided others by the current 
AVRS. For example, the element “use of trim,” if 
performed poorly in the final approach, might 
change an otherwise G performance into an F 
performance in the “live” situation. The same 
maneuver would, all else being equal, be graded a 
G from viewing the video example since the ele- 
ment “use of trim” is not observable from the 
current AVRS recqrdings. Thus both the F grade 
and G grade would represent a valid grade with no 
way of relating the two. Reliability, on the other 
hand, can be assessed if based on the assumption 
that scale reliability relates directly to the capa- 
bility of instructors to make standard and consis- 
tent judgments as to the level of performance of a 
maneuver. Neither tire instructor pilots nor the 
study team members were oblivious to the 
possibility that this assumption might not be valid 
in all instances. However, utilization of the scales 
to grade a given set of maneuvers by expert and 
experienced instructor pilots provided the best 
method available for assessing scale reliability. 

The two parts of this section report the plan- 
ning for and conduct of the test and the analysis 
of the data obtained. 

Test Planning and Conduct 
Test Planning 

The primary planning efforts consisted of deter- 
mining the specific number of instructor pilots and 
maneuver examples required for making the 
requisite number of judgments for testing scale 
reliability, preparing the l-inch test video tape, 
and selecting the test site and the insturctor pilots 
who were to participate in the tests. 
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An estimate of the minimum number of 
instructor pilots desired for the test effort was 
made both in terms of Type 1 and Type II errors. 
With respect to a Type I error, the question was 
one of estimating the smallest correlation which 
one would want to conclude as significant. In the 
case of the intraclass correlation, for a correlation 
of .50 to be significant, a minimum sample size of 
12 is needed when a is .05. With respect to Type II 
errors, the question was one of estimating the like- 
lihood of incorrectly accepting the hypothesis that 
no correlation exists in the population samples. 
According to Winer’s (1962) discussion of the 
intraclass correlation, this correlation is related to 
an F ratio, and the power of this ratio can be 
determined provided the variance can be estimated 
and the minimum difference between means to be 
detected can be stated. The variance was estimated 
to be 2.50 on the basis of pre-study data. The 
minimal difference was set at one scale unit 
between a pair of mean scale values for a pair of 
maneuver examples. Using the technique discussed 
by Winer (1962, p. 104), it was found that 0 
would be .90 for an a of .05, and .70 for an a of 
.01, with five maneuver examples and 20 
observers. If the value between maneuver example 
means is reduced to one-half of a scale unit, 0 
would be approximately .70 with a .05. With a 
smaller variance or more maneuver examples, the 
power of the tests increases. These considerations 
appeared to justify the use of a sample of 20 
observers. 

The principal factor limiting the maximum 
number of examples to be used per maneuver was 
the length of time available per observer. However, 
an estimate of the increase in the correlation 
coefficient anticipated from increasing the number 
of examples (homogeneous with the original exam- 
ples) was made using the preliminary data. In the 
preliminary data, a correlation of .50 was repre- 
sentative of the intraclass correlation coefficients 
obtained on the basis of four examples per maneu- 
ver. Increasing the number of examples to eight 
results in an anticipated new correlation of .67 
(Guilford, 1954, p. 391). It was, therefore, 
decided to use a minimum of eight examples per 
maneuver for the test effort. 

As with the test conducted using the prelim- 
inary pilot performance reference scales, a special 
video tape was prepared. This tape contained 
examples of maneuvers which were to be judged 
on the degree of “goodness” of pilot performance 
by experienced instructor pilots using the final 
scales. The selection of a maneuver example for 
use during the test was based upon several guide- 
lines. The primary guidelines for selection were 




those of clarity and completeness of the example 
and the estimated level of performance depicted. 
Other factors were the number of examples that 
could reasonably be judged during a two-hour 
period and the need for repeating examples (from 
those in the first test set of examples as well as 
those within the second set) as an input into state- 
ments regarding final scale reliability. Figure 5 
summarizes the contents of the video tape de- 
veloped for the finai test and evaluation. As with 
the first test, non-test examples of the three 
maneuvers to be judged were included on the test 
tape so as to provide orientation and practice for 
the instructor pilots who were not familiar with 
the video output of the AVRS. Two of the eight 
examples for the Final Turn, two of the eight 
examples for the Vertical S “A,” and one of the 
nine examples for the Lazy Eight were repeated, 
making a total of ten examples per maneuver. 
Final Turn examples 1 and 3 were repeated as 
examples 5 and 9; Vertical S “A” examples 2 and 
3 were repeated as examples 8 and 9; arid Lazy 
Eight example 3 was repeated as example 9. As 
can be ascertained from Figure 5, each example 
was identified as to its location on the tape and 
the length of time required to view each example 
and each group of examples with appropriate 
remarks included for use by the test director. 
Every attempt was made to select maneuver 
examples which were representative of major 
discriminable points across the scale, as well as 
examples which would illustrate varying degrees of 
overall performance of the performance elements. 
This effort was not entirely successful because 
such a variety of examples was not available from 
the data bank. 

The final major planning effort for the test was 
to establish the time and place for the test sessions 
and to obtain the necessary number of judges (i.e., 
experienced instructor pilots). As previously 
stated, the video examples of the maneuvers were 
taped from missions flown from Vance Air Force 
Base. To obtain accurate judgments on level of 
performance, its was considered important that 
the judges be familiar with the terrain over which 
the maneuvers were flown. For example, in the 
Final Turn to Landing maneuver, familiarity with 
the ground track, the immediate surrounding area, 
and the active runway would allow the instructor 
pilot to make more accurate performance judg- 
ments. To a lesser extent, familiarity with the 
terrain was also important in judging the Lazy 
Eight where the section lines, visible on the video 
recordings, provide familiar cues. Such cues would 
not be as useful to instructor pilots from Williams 
Air Force Base, for example, where the terrain 
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Maneuver 



Counter 



Min. 



Sec. (Min.) 



Remarks 



NLP 


0- 20 


2 


+ 


10 




Hacker - Right-Hand Pattern 


L8 


21- 43 


2 


+ 


25 


( 8) 


2 examples 


VSA 


45- 80 


3 


+ 


25 




1 1/2 examples 


FT 1 


83- 94 


1 


+ 


05 




Vance - Right-Hand Pattern 


r i i 

o 


95-103 


0 


+ 


45 




Hacker - Left-Hand Pattern 


L 


111-120 


0 


+ 


55 




Hacker - Left-Hand Pattern 


J 

A 


122-132 


0 


+ 


50 




Vance - Left-Hand Pattern 


*T 

A 


135-146 


1 


+ 


05 


(10) 


Vance - Right-Hand Pattern 


J 

A 


147-159 


1 


+ 


05 




Hacker - Left-Hand Pattern 


u 

7 


160-172 


1 


+ 


05 




Vance - Left-Hand Pattern 


/ 

Q 


173-185 


1 


+ 


05 




Vance - Left-Hand Pattern 


o 

Q 


200-2 1 0 


0 


+ 


55 




Hacker - Left-Hand Pattern 


10 


212-224 


1 


+ 


00 




Hacker - Left-Hand Pattern 


VSA 1 


225-253 


2 


+ 


20 




heading 350 


2 


255-281 


2 


+ 


15 




heading 080 


3 


289-317 


2 


+ 


10 




heading 350 


4 


318-347 


2 


+ 


15 




heading 190 


5 


348-383 


2 


+ 


35 


(24) 


heading 230 


6 


3844 30 


3 


+ 


15 




heading 350 


7 


433468 


2 


+ 


30 




heading 1 10 


8 


470-506 


2 


+ 


25 




heading 080 


9 


507-540 


2 


+ 


10 




heading 350 


10 


542-578 


2 


+ 


10 




heading 170 


L8 1 


579-610 


1 


+ 


50 






2 


612-639 


1 


+ 


30 






3 


641-661 


1 


+ 


05 




Stop on instr. at end 


4 


662-694 


1 


+ 


40 






5 


696-728 


1 




35 


(15) 




6 


731-764 


1 


+ 


30 






7 


765-798 


1 


+ 


20 






8 


800-836 


1 


+ 


30 






9 


837-868 


1 


+ 


05 




Stop on instr. at end 


10 


870-926 


1 


+ 


50 







Fig. 5. Final test and evaluation inventory of maneuver examples. 



offers no such section lines and other cues must be 
used to assist in judging performance levels. There- 
fore, Vance Air Force Base was selected as the 
location for the tests. A request for 24 instructor 
pilots was made in order to allow for unpredic- 
table cancellations. As it turned out, 23 of the 24 
instructor pilots scheduled were able to participate 
in the test. Because of the operational commit- 
ments of the instructors, the final schedule con- 
sisted of eight two-hour sessions over a three-day 
period, with three instructors participating in each 
session. It was also requested that as many instruc- 
tors as possible who participated in the first test 
on the preliminary scales be included among those 
participants in the test being described here (six of 



the nine so responded). It was recognized that one 
instructor per session would have probably 
resulted in a more valid test effort, but this was 
not operationally feasible. However, it is the 
opinion of the on-site director that independent 
judgments concerning the level of performance of 
the maneuver examples shown were, in fact, 
obtained. This opinion is based primarily upon the 
outstanding cooperation and interest demon- 
strated by the participants during the tests. Except 
for after-the-fact comments or an occasional 
spontaneous exclamation over some exceptionally 
poor or outstanding performance being shown, no 
visual or verbal interaction occurred between the 
instructor pilots during the test sessions. 



Test Conduct 

The tests were conducted on October 7,8, and 
9, 1969, with three sessions on the 7th and 9th 
and- two sessions on the 8th. Each session con- 
sisted of two primary phases. The first phase was 
an approximate 20-minute period of orientation. 
This orientation consisted of a short briefing by 
the test director covering the purpose and 
objectives of the study, what was hoped to be 
accomplished from the inputs to be provided by 
the instructor pilots, a review of the scales to be 
used, an explanation of somb rules governing the 
conduct of the test, and a viewing of a sample of 
each of the three maneuvers. A number of 
important points were covered during the 
orientation: 

1. The instructor pilots were requested to 
judge performance levels. using the scales provided 
and to accept the verbal portion of the scales as 
written. 

2. The scales were experimental in nature and 

were not being developed for operational use by 
the Air Force during UPT, but were part of a 
larger program. ; 

3. The instructor pilots were not being tested 
on how they graded or the accuracy of their 
grades, nor were their assigned grades to be 
compared with other instructor pilot assigned 
grades for any purpose other than those related to 
making statements about the reliability of the 
scales being tested. 

4. Grade sheets (i.e., the pilot performance 
reference scales) were being identified by the name 
of the instructor pilot making the judgments. This 
was for the purpose of possible use by the project 
staff in the event analysis of the data suggested 
that additional useful information relevant to scale 
reliability could be obtained from instructor pilots 
who participated in the test. 

5. The instructor pilots were advised that they 
could see an example or portion thereof as many 
times as they felt necessary, or make use of the 
slaw-motion and stop-action capability of the 
video playback equipment. 

6. The test director explained the importance 
of receiving independent judgments of perform- 
ance levels and further requested that there be no 
verbal exchange of infonnation about the 
maneuver example until after all participants had 
finished viewing and grading the maneuver. 

7. Each of the pilot performance reference 
scales for the three maneuvers were reviewed as to 
content and use of the 10-point scale. In order to 



achieve maximum understanding of the 10-point 
scale in the minimum of time, a special depiction 
of the scales was shown and related to the familiar 
4-point grading system. This depiction was as 
follows: 




It was also explained that the verbalizations under 
the scales for each element applied to. an area on 
the scale and did not apply to any specific point 
on the scale. 

8. The orientation phase of each session was 
concluded with a viewing and discussion of the 
non-test examples of each of the three maneuvers. 

The test was conducted following very simple 
guidelines. The instructor pilots were given 10 
copies of each of the three pilot performance 
reference scales for use as grade sheets. The 30 
maneuver examples were then shown, in order, 
with each example being graded immediately after 
its showing. Before each showing of the Final Turn 
to Landing, information was provided regarding 
the field (i.e., Hacker or Vance), runway, and 
pattern (left- or right-hand). For the Vertical S 
“A” examples, the basic course to be maintained 
throughout the maneuver was given. The reasons 
for providing this information are fairly obvious in 
that they are always known to an instructor pilot 
in the “live” situation. Primarily because of the 
short duration of the maneuvers on the video, it 
was necessary to provide instructors with a situa- 
tion wherein they would only have to concentrate 
on evaluating the performance. 

Although it was not a formal part of the test, 
the participants were not discouraged from making 
whatever observations they desired concerning the 
use of the video playback to judge level of 
performances and concerning the associated pilot 
performance reference scales. These comments are 
summarized and discussed as follows: 

1. Element 2E of the Final Turn to Landing 
scale. One instructor pilot felt that since the T-37 
does not have an angle'of attack indicator and the 
scales do not provide for “control of power” as an 
element, element 2E(and the description provided 
along the scale), per se, is of minimal value and 
could even be misleading. In order words, even 
though the pitch angle does control airspeed (as 
the present scale so states), it must be correlated 
with the use . of power (which controls the rate of 
descent) in order to obtain a meaningful evalua- 
tion of the level of performance. It is obvious that 



this often misunderstood aspect of the final 
approach must be clarified in any revised version 
of the scales. 

2. Element 2D of the Final Turn to Landing 
scale. One instructor pilot stated that cross-wind 
correction and an angled approach were two 
different conditions and recommended that they 
be shown as separate elements. Although the 
recommendation is a good one, it is the opinion of 
the study team that additional insights into the 
dependencies and independencies of these two 
elements must be developed, and that the verbal 
descriptions under the scale opposite the current 
element 2D should be clarified. 

3. Inside vs. outside views. Most instructor 
pilots commented that they would be better able 
to grade the Final Turn to Landing if the video 
views presented were of the instruments through- 
out all (or most of) the maneuver. The only 
outside view felt to be desirable would be a short 
period during the final approach. The participants 
were also of the opinion that more inside views 
and a more consistent expectant pattern between 
outside and inside views (or preferably, a split 
screen showing both) would result in more valid 
grade assignments. In general, the study team 
supports this- comment. However, additional 
tradeoff analysis would have to be made before 



coming to any conclusions since the AVRS is, first 
of all, a training instrument. Also relevant in the 
tradeoff analysis would be the development of an 
effective use of the audio portion of the AVRS. As 
has previously been slated, no audio was provided 
during the test. 

4. Element IB of the Vertical S “A ” One 
instructor pilot recommended that the ±10 feet 
verbalized at the ideal, or E, end of the scale be 
changed to ±20 feet and that the other values be 
adjusted accordingly. (The other two participants 
in the session concurred.) The primary reason for 
this change is that ±10 feet would be extremely 
difficult to read from the current T-37 altimeter 
dial, especially from a video playback using the 
AVRS. Although this reason is not a valid one 
when determining level of performance require- 
ments for different degrees of “goodness,” it 
becomes a serious consideration in discussions rela- 
tive to the AVRS and its use as a tool for eval- 
uating performance levels. 

Table 5 presents an overall view of the number 
of judgments made during the conduct of the test 
at Vance Air Force Base. In summary, a total of 
7,148 elements were judged on the basis of video 
playbacks of 608 maneuver examples of the Final 
Turn to Landing, Vertical S “A,” and Lazy Eight. 



Table 5. Summary of Number of Observations by 23 Instructor Pilots 



Total Judgments 



Maneuver 


N 

^ Judges 


N 

Examples 

Judged 


Total 

Examples 

Judged 


N 

Maneuver 

Examples 


N Judgments 
per 

Example 3 


Total 

Judgments 


FTL 


23 


10 


230 


230 


13 


2,990 


VSA 


15 


10 


150 










3 


6 b 


18 










5 


4 C 


20 


188 


11 


2,068 


L8 


19 


10 


190 










4 


0 d 


0 


190 


11 


2,090 


Total 








608 




7,148 



a FTL had 12 elements plus overall grade; VS A and L8 had 10 elements plus overall grade* 
^Three IPs judged only examples 1 through 6* 
c Four IPs judged only examples 7 through 10. 

^Four IPs did not judge L8. 



Analysis of Data 

This section reports the data which were col- 
lected during the test and the results of the 
analysis of this data. It is organized into two 
subsections concerned with the number of 
discriminations that were made for each of the 
scaled maneuvers and the reliability of the scales. 2 

The data relevant to determining scale dis- 
criminatory properties were analyzed in response 
to the following two objectives: 

1. To determine the number of times the 
judges used each of the 10 points on the scales. 

2. To determine the degree to which the 
judges agreed whether they could or could not 
grade a performance element based upon observa- 
tion of the video replay of a given maneuver 
example. 

Since expert and experienced instructor pilots 
were used to judge the level of performances, scale 
reliability was related directly to their capability 
to make similar judgments. Therefore, the data 
relevant to determining scale reliability were 
analyzed with respect to answering the broad 
question of how well the 23 instructor pilots 
agreed on the grades they assigned to the maneu- 
ver examples and their performance elements. 

Before proceeding with a detailed report of the 
data analyses, it is important to emphasize a basic 
assumption accepted throughout the analysis. As 
has been stated, the instructor pilots were 
requested to make their judgements on the level of 
performance depicted by the video replays using 
the scales provided and to accept the verbaliza- 
tions along each of the scales as written. There is 
no objective evidence available upon which a state- 
ment can be made that the participants did or did 
not use the scales as requested. Therefore, the 
assumption was made that they did and the 
analyses conducted accordingly. Although the 
scope of this effort did not so require, it might 
have proved interesting to analyze the data under 
the assumption that the instructor pilots did not 
really use the scales and compare the results of the 
two analyses. To clarify what the difference is, the 
former assumption (the one used for this report) 
permits the testing of scale reliability whereas the 
latter assumption tests for the reliability (or stand- 
ardization) of the grading of the instructor pilots. 

2 A detailed description of various aspects of the 
reliability studies has been prepared as a separate 
unpublished appendix which is available to qualified users 
upon request to AFHRL (FT), Williams Air Force Base, 
Arizona 85224. 



It is believed that (apparently) different results 
would be obtained; a detailed comparative analysis 
would provide additional insights and inputs into 
further scale development efforts. 

Number of Discriminations 

Given the 30 examples of the three maneuvers 
selected for scaling and test and evaluation, Figure 
6 shows the number of times each of the ten 
points were used to indicate a level of perform- 
ance; also shown are possible distributions of usage 
of points when the scaling device is collapsed into 
a 7-point and a 4-point scale. The scale point 
receiving the greatest usage is at point G for the 
FTL, L8, and Total, and at point G— for the VSA. 
Except for the L8, it also shows two breaks from 
an upward and downward progression at points F+ 
and G+. The data in Figure 6 show only that 
instructor pilots, given a 10-point scale, used all 
points on the scale for grading; whether or not 
they are able, in fact, to make valid 10-point 
discriminations, or whether or not such discrimina- 
tions are important to make during UPT, was not 
ascertained from the available data. The points F+, 
G— , G+, and E- on the 7-point scale are of special 
interest; their relatively high usage suggests, even 
though projected from the 10-point scale, that 
there might be some validity to those points and 
that their possible value to the UPT program 
should be investigated. 

Tables 6, 7, and 8 summarize the data obtained 
relative to 'the determination of the degree to 
which the judges agreed which performance 
elements could or could not be graded. Table 6 
shows the number of times, by performance 
element and maneuver, the judges marked the 
“not observed” box on the scales. The term “not 
observed” had two meanings in the context of 
these scales. The first connotation was, of course, 
that the applicable element could not be seen on 
the video replay (e.g., the airspeed, element 1 A, at 
the start of the Final Turn). The second meaning 
was that a judge considered that the information 
presented on the video replay of a given perform- 
ance element was insufficient for the purpose of 
grading that element (e.g., if an inside view of the 
airspeed indicator was shown only once on a Final 
Turn to Landing at, say, the 90° position, the 
judge might decline to grade the control of 
airspeed, element 2B, without knowing what the 
airspeed was indicating during the final approach). 
No differentiation was made between the two 
meanings for this test. It can be seen from Table 6 
that 1,569 out of the total of 7,148 (or 22 
percent) grading possibilities were marked as “not 
observed.” Performance elements 1A through 2C 
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Fig. 6. Usage of points on 10-point, 7-point, and 4-point scales. 
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1,222 



accounted for 1,354 (or 86 percent) of the total 
“not observed.” This was due to the fact that the 
maneuver examples available for the test did not 
always begin at the study definition of the. start of 
the maneuver, or an inside view of the airspeed, 
for example, was not given during the performance 
of the maneuver. This latter reason is best shown 
by element 2B. This element was marked 146 
times as not observed for the Final Turn to 
Landing and Lazy Eight maneuvers (showing a 
nonstandard mixture of inside and outside views), 
but was always graded (i.e., no “not observed”) 
for the Vertical S “A” (showing only inside views). 
This illustrates the need for greater coordination 
and standardization between inside and outside 
views obtained during a training mission so as to 
assure capturing a greater percentage of those 
performance elements pertinent to grading a 
maneuver. An alternative would be the incorpora- 



tion of a split-screen capability in an updated 
version of the AVRS. 

Tables 7 and 8 show the degree to which the 
judges, as a group, agreed which elements could 
and which could not be graded on the basis of 
what was shown of a maneuver example on the 
video playback. A 100 percent agreement would 
exist if all judges graded an clement or if all 
marked it “not observed.” The issue is dichoto- 
mous-either a performance element was graded or 
it was marked “not observed.” The grades 
assigned, when assigned, were not considered in 
the development of Tables 7 and 8. In order to 
present a broader view of the agreement figures, 
the columns in both Tobies 7 and 8 are divided 
into Groups 1, 2, 3, and 4. Group 1 represents 95 
to 100 percent of all the instructors who graded 
any given maneuver or performance element. For 
example, 23 instructors graded each of the 10 





reassert. 



Table 6. Number of Times Elements Marked 
“Not Observed” 



Performance 

Element 




Maneuver 




Total 


FTL 


VS A 


L8 


1A 


163 


79 


129 


371 


IB 


162 


80 


- 


242 


1C 


138 


-- 


94 


232 


ID 


— 


73 


— 


73 


IE 




.. 


125 


125 


1G 


— 


68 


— 


68 


2B 


71 


0 


75 


.146 


2C 


97 


» 


— 


97 


2D 


1 


2 


33 


36 


2E 


14 


1 


36 


51 


2F 


.. 


— 


4 


4 


2G 


6 


— 


29 


35 


21 


— 


— 


3 - 


3 


2J 


12 


0 


-- 


12 


4 


9 


0 


19 


28 


5 


8 


2 


— 


10 


9 


32 


— 


- 


32 


Overall 


0 


0 


4 


4 


Total 


713 


305 


551 


1,569 



examples of the Final Turn to Landing. The 
number of instructors in Group 1 would then 
equal 22 to 23. This means that 22 to 23 instruc- 
tors would have recorded a grade or marked an 
element as “not observed” to be considered in 
Group 1. Group 2 represents 90 to 100 percent of 
the instructors; Group 3, 80 to 100 percent; and 
Group 4, 70 to 100 percent. To repeat, the 
percentage is computed on the number of 
instructors grading a maneuver or performance 
element (23 for the Final Turn to Landing, 18 for 
examples 1 through 6 of the Vertical S “A,” 20 
for examples 7 through 10 of the Vertical S “A,” 
and 19 for the Lazy Eight). As the percentage 
range broadens, more instructors are allowed into 
the “area of agreement.” 

Table 7 shows the degree of agreement across 
each of the three types of maneuvers and across all 
maneuvers. To illustrate, the first cell under Group 
1 for the Final Turn to Landing (67 percent) will 
be used. As previously stated, there were 12 
performance elements (the element “overall 
grade” was not included in these computations) to 
be graded for each Final Turn to Landing shown. 
This means that there were 12 x 10, or 120, cells 
to each of which 23 instructors were to either 
record a grade or mark a “not observed.” By 



actual count, 22 or 23 of the instructors agreed in 
80 of the 120 cells (or 67 percent) that they could 
grade or could not grade a given element in a given 
example. Conversely, this also means that two or 
more instructors did not agree with the others as 
to whether a given performance element could or 
could not be graded in 40 of the 120 cells (or 33 
percent). In the same row under Group 4, 16 to 23 
instructors (70 - 100 percent) agreed that they 
could or could not grade an element in 1 15 of the 
cells (or 96 percent). Except for the Group 1 Lazy 
Eight, this table shows a relatively high rate of 
agreement among the instructors who participated 
in the test as to what could or could not be 
graded. This table also shows that there was a 
greater degree of agreement between instructors in 
a greater number of cells in the Vertical S “A” 
maneuver than the other two maneuvers. 

Table 8 is like Table 7, being based on similar 
computations. However, it is oriented towards the 
elements across all three maneuvers. Since element 
1A is graded in each of the three maneuvers, there 
are 30 cells (three maneuvers times ten examples 
each). Element IB is graded only for the Final 
Turn to Landing and Vertical S “A”; therefore, 
there are only 20 cells in that row. The purpose of 
this table is to show the level of agreement among 



Table 7. Degree of Instructor Agreement by Maneuver 



Maneuver 


Total 

Number 

Cells 


Group 1 
95-100% 
IPs Included 


Group 2 
90-100% 
IPs Included 


Group 3 
00 100% 
IPs Included 


Group 4 
70*100% 
IPs Included 


N Cells In 
Agreement 


Percent 


N Cells in 
Agreement 


Percent 


N Cells In 
Agreement 


Percent 


N Cells In 

Agreement Percent 


FTL 


120 


80 


67 


90 


75 


no 


92 


115 96 


VSA 


















Examples 1 - 6 


60 


49 


82 


51 


85 


53 


88 


53 88 


VSA 


















Examples 7-10 


40 


28 


70 


29 


73 


37 


93 


40 100 


L8 


100 


47 


47 


64 


64 


77 


77 


89 89 


All Maneuvers 


320 


204 


64 


234 


73 


277 


87 


297 93 



Table 8. Degree of Instructor Agreement by Element 



Group 1 Group 2 Group 3 Group 4 

95-100% IP* Included 90-100% ll*» Included 80-100% IP* Included 70-1 00% IP* Included 



Element 


Total 

Number 

Cells 


N Cells in 
Agreement 


Percent 


N Cell* in 
Agreement 


Percent 


N Cells in 
Agreement 


Percent 


N cell* In 
Agreement 


Percent 


1A 


30 


17 


57 


22 


73 


26 


87 


28 


93 


IB 


20 


14 


70 


16 


80 


18 


90 


18 


90 


1C 


20 


12 


60 


12 


60 


14 


70 


14 


70 


ID 


10 


4 


40 


4 


40 


7 


70 


8 


80 


IE 


10 


3 


30 


4 


40 


4 


40 


7 


70 


1G 


10 


6 


60 


6 


60 


8 


80 


9 


90 


2B 


30 


17 


57 


18 


60 


24 


80 


29 


97 


2C 


10 


0 


0 


0 


0 


5 


50 


7 


70 


2D 


30 


25 


83 


26 


87 


27 


90 


29 


97 


2E 


30 


19 


63 


21 


70 


25 


83 


26 


87 


2F 


10 


9 


90 


10 


100 


10 


100 


10 


100 


2G 


20 


11 


55 


17 


85 


17 


85 


18 


90 


21 


10 


10 


100 


10 


100 


10 


100 


10 


100 


2J 


20 


16 


80 


19 


95 


20 


100 


20 


100 


4 


30 


22 


73 


25 


83 


30 


100 


30 


100 


5 


20 


19 


95 


20 


100 


20 


100 


20 


100 


9 


10 


1 


10 


3 


30 


9 


90 


10 


100 



instructors relevant to the performance elements. 
Performance clement 2C (control of altitude in the 
Final Turn to Landing) is of special interest in that 
there is no agreement among the instructors as to 
their capability to grade or not grade the clement 
until Group 3 (or at least 18 of the 23 instructors) 
is reached; then they agree only in half the cells. 
This shows that additional refinements must be 
made to this clement in order to raise the degree 
of agreement. Similar analyses can be made for 
other elements in Table 8. Ideally, all the cells 
witlii'n Tables 7 and 8 should contain the 100 
percent figure. Since they do not, these tables 
show that there must be additional refinements 
made to the AVRS and its application into the 



UPT program for purposes of both flight training 
and use of the audio-video tapes for evaluating 
levels of performance. A high percentage of 
agreement between instructors upon viewing a 
given example of a maneuver as to what can or 
cannot be graded should be a criterion to be met 
in a redesigned AVRS, if it is to be used for 
evaluating performance levels. Once this criterion 
is satisfied, the problems of re-design and utiliza- 
tion of an AVRS for evaluating levels of 
performance then become associated with the 
degree to which instructor pilots can agree as to 
what level of pcrfonnance (or grade) is to be 
assigned to a given performance. These problems 
arc addressed in the following paragraphs. 



V 

Scale Reliability 

Seale reliability was evaluated from two differ- 
ent aspects, not completely unrelated, so as to 
provide a diversity of insights into the pilot 
performance reference scales, as developed, and 
their utilization by experienced instructor pilots. 
One aspect of the evaluation is concerned with a 
series of analyses based upon the results of scale 
usage by combining the judgments of the 23 
instructors, as a group, who participated in the 
test. Included in this evaluation is the use of the 
intraclass correlation. The second aspect of the 
evaluation is concerned more with individual 
instructor judgments using product-moment 
correlations based upon their use of the scales to 
grade duplicate examples of maneuvers purposely 
incorporated in the test efforts. (Details of these 
analyses are available to qualified requesters.) 

t 

Table 9, 10, and 11 contain a summary of the 
raw data and some computations of the results of 
the judgments of the 23 pilots from grading 
examples of the Final Turn to Landing, Lazy 
Eight, and Vertical S “A,” respectively. The data 
arc organized by the three maneuvers, example of 
the maneuver (ten examples for each maneuver), 
and performance elements within each example. 
Using example 1 in Table 9 to illustrate the 
contents of the three tables, the N given for each 
example gives the total number of instructors who 
viewed the example during the tests. The perform- 
ance elements listed in the left-hand column are 
only those applicable to that maneuver. The five 
figures opposite each performance element give 
the mean (the average score of all those recorded 
by the instructors who judged that element), 
variance, the highest and lowest grade assigned by 
the instructors, and the number of instructors who 
actually graded the element. This latter number is 
not always equal to the stated N since one or more 
of the instructors may have checked the “not 
observed” box on the scale. The number of 
instructors who marked a performance element 
“not observed” can be computed by subtracting 
the number who graded the element from the N 
for that example. Also computed and shown in the 
tables are the mean value of all the element means, 
the variance of the element means, the mean value 
of the element variances, and the variance of the 
element variances. The numerical values were 
obtained by assigning the numbers 0 through 9 to 
the 10 scale points from left (U) to right (E). The 



blank cells within each of the tables occur for two 
reasons: first, obviously, is that no grades were 
assigned by any of the instructors who viewed the 
example; secondly, the data were not included if 
less than half of the instructors viewing the 
example recorded a grade for a given element. It 
was felt that recording data from a group of 
responses where less than half of those responding 
agreed that the dement could even be graded 
would not be representative of that group. 

Figure 7 illustrates the level of performance 
assigned by the instructors to each maneuver 
example. The number used was the mean of the 
overall grade (c.g., 4.35 for example I, Table 9). 
This value was considered the most valid since all 
instructors who viewed the example (except for 
four instances-one instance each for four differ- 
ent examples) had an input into determining its 
value. The four instances just mentioned were 
inadvertent omissions by the instructors concerned 
and not discovered in time to obtain their 
independent judgments. The example numbers 
which are circled or enclosed by a square in pairs 
in Figure 7 are those examples which were 
repeated on the video test tape. This figure also 
shows that examples 1,4, and 10 (each different 
examples of the Vertical S “A”) depict, essen- 
tially, identical levels of overall performances. An 
inspection, in Table II, of the data contained 
under the three examples (1, 4, and 10) show 
differing levels of performance of the elements 
which make up the total maneuver. The data also 
show that the student in example 1 was experi- 
encing the greatest difficulty in element 2J 
(control rate of ascent/descent), the student in 
example 4 element 2D (control of heading), and 
the student in example 10 not any particular 
element but, if anything, element 5 (transitioning) 
was his greatest problem. This analysis is also 
illustrative of the basic concept used in the devel- 
oping of the scales: that of scaling the skills that 
are being taught a student, via performance of 
defined maneuvers, so that acquisitions (rate and 
levels) of given skills can be identified. Assign- 
ments of level of performance to overall maneu- 
vers are not indicative of the problems in skill 
acquisition a student may be experiencing, nor are 
they indicative of similar problems concerning 
identical skills being taught by more than one 
maneuver. 




Table 9. Summary Data of Final Turn to Landing 
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Table 10. Summary Data of Lazy Eight 
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Table II. Summary Data of Vertical S “A 
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Table 12 presents a comparison of the mean 
overall grade, computed from individual assign- 
ments by all the instructors, to the mean grade 
computed from the performance element means. 
As can be ascertained from inspection, in every 
case the mean of the element means is greater than 
the mean of the overall grade. In addition, the 
difference between the two means is less than one 
in all but two of the 25 cases (duplicate examples 
cannot be considered different examples). The two 
cases (example 1 of the FTL and example 2 of the 
VSA) are repeated in their identical counterparts 
(example 5 of the FTL and example 8 of the 
VSA). As can be anticipated from inspection, the 
correlation coefficient for each of the three 
maneuvers is, in fact, extremely high. 

It can, therefore, be stated that the pilot 
performance reference scales in their current state 
of development arc highly reliable providing they 
are used and evaluated under conditions similar to 
those described. The most important condition is 



considered to be that of using a level of perform- 
ance obtained by computing the mean of scores 
assigned by an experienced group of at least 12 
instructor pilots. Obviously, such conditions arc 
impossible to accept in an operational context and 
the reliability of scales must be demonstrable to 
the level of single instructor usage. 

The concensus of those who evaluate and 
record level of performances is that not all of the 
factors which affect the final grade are weighted 
equally, nor do all instructors agree as to which 
factors are the most important (or have tire 
greatest impact) to the establishment of a final 
grade. For example, one instructor may. consider 
airspeed control to be of greater importance to a 
Vertical S “A” than control of the ascent/dcsccnt, 
and another may be of the reverse opinion. This is 
not only common knowledge, but the results from 
the initial test and evaluation of the preliminary 
scales, shown in Table 3, are evidence of this 
knowledge. Table 12 represents additional support 



31 



39 




t 

9 . 



Table 12. Comparison of the Mean of the Element 
Means and the Mean Overall Grade 



Maneuver 


FTL 


L0 


VS A 


Example 1 

Overall 


4.35 


6.00 


4.50 


Mean 


5.53 o 


6.15 


6.14 


Difference 


1.18 


.15 


.64 


Example 2 
Overall 


6.26 


4.67 


3.28 


Mean 


6.35 


5.24 


4.36 x 


Difference 


.09 


.57 


1.08 


Example 3 
Overall 


4.74 


5.11 


3.56 


Mean 


5.13 x 


5.57 x 


3.98 o 


Difference 


.39 


.46 


.42 


Example 4 
Overall 


5.91 


5.74 


4.56 


Mean 


6.18 


6.20 


5.44 


Difference 


.27 


.46 


.88 


Example S 
Overall 


3.30 


6.26 


8.17 


Mean 


4.98 o 


6.84 


8.28 


•Difference 


1.68 


.58 


.11 


Example 6 
Overall 


4.74 


6.2! 


3.28 


Mean 


5.46 


6.68 


4.07 


Difference 


.72 


.47 


.79 


Example 7 

Overall 


3.65 


4.00 


6.10 


Mean 


4.33 


4.54 


6.61 


Difference 


.68 


.54 


.51 


Example 8 
Overall 


7.35 


2.89 


2.90 


Mean 


7.49 


3.49 


4.81 x 


Difference 


.14 


.60 


1.91 


Example 9 
Overall 


5.13 


4.58 


3.80 


Mean 


5.54 x 


4.95 x 


4.28 o 


Difference 


.41 


.37 


.48 


Example 10 
Overall 


4.52 


5.00 


4.50 


Mean 


4.98 


5.85 


4.83 


Difference 


.46 


.85 


.33 


r 


.939* 


.986* 


.934* 



Not#.— 



*p < .001 



> identical examples 
‘ identical examples 
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in that in all cases there is a difference between the 
mean overall grade and the mean of the element 
means. This indicates that there are some perform- 
ance element or elements being weighted more (or 
perhaps some a great deal more and some much 
less) than others. Example 8 (and its identical 
example 2) of the Vertical S “A” and example 5 
(with its identical example 1) of the Final Turn to 
Landing appear to be the most likely candidates 
for obtaining initial data as to what weights 
instructors actually applied to the elements within 
those examples to arrive at a final overall grade so 
different from the straight mean of the element 
grade. Such an analysis would not ignore all other 
examples, but those mentioned would be a logical 
start point. This observation demonstrates the 
necessity for obtaining further insights into what 
elements are most important to a maneuver, in 
terms of a weighting factor as well as the impor- 
tant skills (or essential elements of performance) 
to be learned, in future development efforts to 
revise and refine the pilot performance reference 
scales presented here. 

The analysis of data to this point has been 
concerned mainly with scale reliability based upon 
considerations of their overall use. Since the major 
development thrust of the pilot performance refer- 
ence scales was oriented towards performance 
elements and a scale specific to each element and 
maneuver, an analysis of these scales was made. 
The primary measure of scale reliability is in terms 
of the variance of instructor pilot judgments. 
Under the basic assumption specified at the 
beginning of Section III, a high variance would 
indicate a low degree of scale reliability, and a low 
• variance a high degree of scale reliability. The basic 
problem, however, is defining what constitutes 
“high” variance and “low” variance. Although it 
appears logical to select a variance of 2.25 (derived 
from assuming a normal distribution of scores with 
at least two-thirds of the scores falling within ±1 
unit on a 10-point scale) for purposes of defining 
“high” and “low” reliability relative to the scales, 
it was considered premature, at this stage in scale 
development, to do so and that the data ought to 
suggest what the factors of scale reliability deter- 
minations should be. Therefore, variances (or scale 
reliability) were analyzed only on a relative basis. 
Results of these analyses, which are detailed in a 
supplementary appendix (available to qualified 
requesters), are reflected in the conclusions 
presented in the following section. 



IV. CONCLUSIONS AND RECOMMENDATIONS 

The following conclusions and recommenda- 
tions are made based upon the result of the test 
and evaluation, and analysis of the data therefrom, 
of the pilot performance reference scales de- 
veloped during this study. The scales, as given in 
Appendix II, are specific to each of three 
maneuvers included in the flight training syllabus 
of the Air Force UPT program— the Final Turn to 
Landing, Lazy Eight, and Vertical S “A.” Each 
maneuver consists of 10 or ’.2 (depending upon 
the maneuver) performance elements individually 
scaled with verbalizations describing four levels of 
performance equally spaced along a dimension 
consisting of 10 possible points for discrimina- 
tions. 

Conclusions 

The following conclusions are made relative to 
the scales, number of discriminations, and scale 
reliability. 

Pilot Performance Reference Scales 

1. There are no bases, from official documen- 
tation upon which to make judgments as to what a 
scale is truly measuring (i.e., determining scale 
validity). 

2. Although a single instructor per session 
would have guaranteed independence of judg- 
ments, such independent judgments were obtained 
during the test effort with multiple judge's. 

3. The inconsistent and unpredictable switch 
between the outside and inside views from the 
video replays (which was not only disturbing to 
the viewer but did not depict, at times, the infor- 
mation needed at a particular moment) and the 
somewhat less than satisfactory quality of some of 
the video replays, affected the judgments made by 
the instructors to some undeterminable extent. 

4. There was a high rate of agreement between 
the instructors as to what performance element 
could or could not be graded for any given 
maneuver example. The greatest area of agreement 
was with those elements of the Vertical S “A.” 
Performance element 2C (control of altitude) in 
the Final Turn to Landing provided the greatest 
area of disagreement. 
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5. The pilot performance reference scales, as 
developed, identify specific problem areas (and the 
extent thereof) being experienced by a student 
during his training not obtainable from a 
maneuver-oriented overall grade. 

6. The grade obtained from the mean grade 
assigned to each of the performance elements of a 
given maneuver is highly predictive of the mean of 
the overall grades assigned. 

7. The results of use of the scale by experi- 
enced instructors showed that these instructors 
were in high agreement as to what constitutes 
exemplary performance but that the grades result- 
ing from performances to varying degrees less than 
exemplary were more variable. 

8. The Lazy Eight maneuver was the most 
difficult to grade (and the most variable) and the 
Vertical S “A” the easiest (and the least variable). 

9. When duplicate maneuver examples were 
shown during a test session, the instructors were 
highly consistent (90.3 percent) in their judgments 
as to whether an element could or could not be 
observed from the video replay. 

10. When the same maneuver example was 
shown on two separate occasions, about one 
month apart, judgments as to whether an element 
could or could not be observed from the video 
replay was a relatively stable one, with the degree 
of stability influenced by the nature of the maneu- 
ver, the particular maneuver example, and the 
instructor making the judgment. 

1 1 . The reliability of overall scale usage and 
some of the individual performance element scales, 
the probability of a high intelligence gain from a 
more detailed in-depth analysis of the available 
data, and the concepts and principles upon which 
the scales were developed, suggest that continued 
development of the pilot performance reference 
scales by expansion, revisions, and refinements 
would provide a product useful to the flying 
training program of the Air Force. 

Discriminations 

1. All points on the 10-point scale were used 
during the evaluation and such usage was 
reasonably normally distributed with scale points 
6 (for the Final Turn to Landing and Lazy Eight) 
and 5 (for the Vertical S “A”) being those repre- 
senting the greatest usage. 

2. Results from grading identical examples 
after a period of time has elapsed suggest that 
scales which require a relatively small number of 



discriminations (four in this case) arc fairly 
reliable. 

Reliability 

1. The pilot performance reference scales as 
developed are highly reliable when used with the 
AVRS and the mean scores of a group of at least 
12 experienced instructors are used to record 
levels of performance. 

2. The individual performance element refer- 
ence scales are considered to be of relatively high, 
medium, and low reliability as follows: 

Final Turn to Landing 

High: Altitude at start (IB); Control of airspeed 
(2B). 

Medium: Airspeed at start ( 1 A); Control of 
pitch angle (2E); Control of angle of bank 
(2G); Control of rate of ascent/descent (2J); 
Error correction (4); Transitioning (5). 

Low: Attitude at start (1C); Control of altitude 
(2C); Control of heading (2D); Use of 
ground reference points or lines (9). 

Vertical S"A” 

High: Airspeed set up on entry (1A); Heading 
set up on entry (ID); Control of airspeed 
(2B); Control of heading (2D); Error correc- 
tion (4); Transitioning (5). 

Medium: Altitude set up on entry (IB); Trends 
on entry ( 1G); Control of pitch angel (2E); 
Control of rate of ascent/descent (2J). 

Low: None. 

Lazy Eight 

High: Airspeed at entry (1A); Position of 
aircraft at entry (IE); Control of airspeed 
(2B). 

Medium: Attitude at entry (1C); Control of 
angle of bank (2G); Control of rate of pitch 
change (21). 

Low: Control of heading (2D); Control of pitch 
angle (2E); Control of rate of roll (2F); 
Error correction (4). 

3. The mean results from another sample of 
experienced instructors, who viewed and graded 
the same maneuver examples, would correlate 
highly with those obtained in the present study. 

4. Based upon use of the scales in grading 
identical maneuver examples of the Final Turn to 
Landing and Vertical S “A” after a time lapse of 
approximately one month, the relative reliability 
of the performance element scales is as follows: 



Final Turn to Landing 

High: Control of pitch angle (2E); Control of 
rate of ascent/descent (2J); Error correction 
(4); Overall grade. 

Low: Control of airspeed (2B); Control of 
heading (2D); Control of angle of bank (2G); 
Transitioning (5). 

Inconclusive Data: Control of altitude (2C); 
Use of ground reference points or lines (9). 

Vertical S “A” 

High: Airspeed set up on entry (l A); Altitude 
set. up on entry (IB); Heading set up on 
entry (ID); Control of airspeed (2B); 
Control of pitch angle (2E); Control of rate 
of ascent/descent (2J); Error correction (4); 
Overall grade. 

Low: Trends on entry (1G); Control of heading 
(2D); Transitioning (5). 

5. Based upon use of the scales in grading 
identical maneuver examples of the three study 
maneuvers when viewings were separated by 
viewing other examples and a time lapse of about 
30 minutes, the relative reliability of the perform- 
ance element scales are as follows: 

Final Turn to Landing 

High: Airspeed at start (1A); Use of ground 
reference points or lines (9). 

Medium: Altitude at start (IB); Control of 
heading (2D); Control of pitch angle (2E); 
Control of rate of ascent/descent (2J); 
Control of angle of bank (2G); Overall grade. 
Low: Attitude at start (1C); Control of airspeed 
(2B); Error correction (4); Transitioning (5); 
Control of altitude (2C). 

Vertical S “A" 

High: Control of airspeed (2B). 

Medium: Control of heading (2D); Control of 
pitch angle (2E); Overall grade. 

Low: Control of rate of ascent/descent (2J); 

Error correction (4); Transitioning (5). 
Incomplete Data: Airspeed set up on entry 
( 1 A); Altitude set up on entry (IB); Heading 
set up on entry (ID); Trends on entry (1G). 



Lazy Eight 

High: Control of heading (2D); Error correction 
(4). 

Medium: Control of rate of roll (2F); Control 
of rate of pitch change (21). 

Low: Control of pitch angle (2E); Control of 
angle of bank (2G); Overall grade. 

Incomplete Data: Airspeed at entry (1 A); 
Attitude at entry (1C); Positioning of air- 
craft at entry (IE); Control of airspeed (2B). 

Recommendations 

On the basis of the conclusions and the overall 
report of the development, test, and evaluation of 
the pilot performance reference scales, the fol- 
lowing recommendations are made: 

1. That the suggested additional analyses as 
outlined in Appendix 111 of the report be 
accomplished. 

2. That the scales developed during this study 
be refined and revised on the basis of the results of 
their evaluation as given in this report (and 
supplemented by the additional recommended 
analysis) and the scope of the scales be expanded 
to include all the maneuvers (or pilot skills) taught 
during UPT. This recommendation does not 
necessarily encompass a requirement that the 
scales be (performance element be) maneuver- 
oriented. 

3. That reliable (and valid to the greatest 
degree possible) pilot performance reference scales 
be developed prior to additional efforts to evaluate 
the usefulness of the AVRS as a tool for deter- 
mining levels of performance. 

4. That the results of this study (and the 
recommended additional analysis) be used as 
inputs into the design specifications for an 
updated AVRS responsive to studies for projected 
utilizations (or studies of different possibilities) 
within the flying training programs of the Air 
Force. 
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APPENDIX /.. 



PRELIMINARY PERFORMANCE ELEMENTS OF SELECTED MANEUVERS 



45 

o 

ERLC 



37 







Maneuver: Normal Landing Pattern Maneuver Segment: Fitch out 

Altimeter 1000 "bovc terrain Altimeter Pointer At correct attitude None Yes 

(2300* for Vance) 

A/S 200 kts. at start A/Sind. Pointer At M 2* f on indicator None Yes 



Appendix I f Continued) 



"■A 

\ 



O 

ERIC 



o 5 

i5 

— £ 
51 



E x 



n 

is 



E • 

w. 

O W 

to 



|5 

V* •» 
O W 



C 3 

2 3 

-I 

o S 



O S 

35 

ft! 

2 o 



y y 
>- >• 



v 

>• 



o 

>• 



o o 



a 

§r 



jo 

«o U 

•3 ,o i3 

u M 

~ o D> 

C 3 12 O 

o ^ Sc 
•j ? ex o 
c ^ exo 
-= o ts-a 
c ♦* o 
< ^ < a 



c 

o 

2 



o 

2 



F o 

* *3 

5 *- 

O o 

.S3 
2 tO 

is s 

E > 
5 o 

os E 



>% 

“3 

ei 



5C 

c 

CS >» 

5 2 

jC O 
o O 



o 

eo 



* 

jd 

; j=> 



ZJ 

AC 

v» 

a 

E 

to 

c: 

o 



. 3 

x c 
o o 

v> r* 

SB 

3 ■* 

*3<> 

U O 

«c 






? S-2 



o o ® g § 



- £ 



C c 
O zj 

*. g 
o = 

*-■ w 

H « 



— o 

S *3 
a. 2 
OS < 



^3 

Sn 

~i 

<> 



^ c — i 

c £ *3 

>si 

> <> 



•3 

£ 

CO 

< 



o 

NO 

o 

«o 

o 



-c 

f- 



o 

M 

ex 

ex 

n 



it 

|i 

X +-* 
c c 



>* J 



-C y 
** ^ 

i°o 

CO 'O 



c 

«* 

3 

*3 

s 

j 



c» 

>• 



c 

y 

E 

CO 



si 



c 

o _ 

*2 y 
O > 
y 3 

o £ 
w . 

-3 O 

I* 

£ E 



^ ’5? 



O y 

i 5 

0-3 
E o 

O XL 
u — 
zj 'C 
a > 



» * o, 

£ J* Si 

* 3 g 5 j> ~ 

E 2 - 
M -3 — 'S v» ^ (m 



S .«* *g S 2 ? 

c * 2 2 c c 
Hg ols g o § 

E £ x n c3 - ft- y 



3 

a C 

J v> 

y 
u tJ 

IS 



*• ^3 

3 H£*g 

*3 £ | w ~ 

Jss 2 > 

><<u> 



to 

c 

£ j= 

3 — 

.2 * 
c C* 

2 a C 
2 ~ £ 

sis 

1*“ o 
#0- 
£2 c 

S * * 
O e S 

3 CX O. 

° S* £ 

OS £ ex 



o 

2 



c 

y 

E 

CO 

*3 

3 

ft? 



CWC 
£ ^ -2 

O 3 " 3 
y u 

C ^ 
o ; 

y < 



3 

£ 

> 



S3 

w 

y 






& 

> 



c 

Cl 

E 

to 

si 

££ 



JD 

?3 E 
ex rr 



1 S 1 
» 3 o S 

£ 3 u ^ 

~ « •= o c *2 

H e £ — 

*• E 3 3 

« O 2 

£ E ex U £ 



*r 2 
O y 

e *5 
o « 

Tt w» 
3 y 

«r r- 

o t: 
Ou o 



C g- 

I gl 

y O y 

5=5 

£ 3 o 
jo "O 



-r 

>H: 



o o 



r> 

3 



3 

O 



ex 

o 



« § E 
S o. w 
£ 3 £ 
o- o 
•Co £ 

-O s 

o 

c 5 £ 

W « 

3 IS 
O. 3 
CO ex 



u O 
(M O 

UJ c 



c 

rs 

6ft 

c 

o 

u 



I 

y 

to 



£ 

a 

** 

n 

Om 

eo 



C 

2 



3 

O 



3 

U. 



c 

? 

o 

•3 

u 

O 

a 



e 

o 



•3 

C 

y 

X 

U3 



C 

? 

o 

-3 

O 

“3 



S J 

C » 
? >* 

>§1 

il 



V «• 



ir: a 

I 

f ? s fv 

'Z O r~ C «0 

a I 

<i* "- 



3 at 

to 
£ 
.w 



c 

J o 
£E 
^2 



6— *•— 
3 3 
* X> 



c w 

Ic 
O o 
^3 ex 



*3 O 

W 2 3 ? 
y 33 

U. £ 3 
: > < 



* *3 

|J 5*1 
r > < < 



A* 

o 

v-) 

? 

*o 

X> 

a 

o 

CO 

to 

c 

£ 

c 

£ 

ra 



x 

w 



4 



Indicator Indication* Decision Performance TV Syttem Criticality of 

Activity or Condition or Stmt or Stimulus Factors Criteria Capability Performance 



/ 



u 

> 



O 

z 



u 

> 



O 

z 





- 


ai 




** 






mm 






c 


— 


g C -3 

e 1J i 


c 

F J 


C 

‘J 

E 




c 

V 


C 

if 




C 

V 






V 

E 

sc 


c 

y 

E 


d-§i 


None 

IPjiidtfi 


S4 

“3 


'J 


% 

•3 


o 

“3 




SC 

*3 


zj 


s> 


-3 

SC 


S4 

*3 


3 3 

.«■ « ft 
» V C 

sc 5 


3 

flu 


c 

o 

Z 


3 

&. 




C 

w 


flu 


*3 


c: 

O 

Z 


3 

flu 


3 

flu 



v 

c. 

o * 

5 . o .j 

4 - *3 S 5 

§■= 

eta 

fc gco 

igja 

u u Zi c* 
w -3 -c a 
jF c » « 
U 3 = a 



2.2 



<30. 



3 

O 

*3 



e 

o 



g 

r c Is £ ‘t 

5 r /J 3 t. 

F £ 5 

ij o 3 3 c >v 
£ -c 




o 

&. 



3 



o 

r« 



J 'J e- 

e* > s 5 

„ r c o n 

x « ~ “3 

w w n a c 

: 'i *3 r £ ~ 

c c*o ; J ° o 

5 ^ o j - 

~ o 3?, v 3 

o 8 = o -3 

- 3 -3 s- 5 z 2 



C 



O “ 

-S 'J 



.£> 

-5 



o 

£ ^ 
* © 
o £ 



4 1 

£ 3 U) o 

W gf 

O 3 „ r~> 

c Jt: c 

§ S c-g c 

° 



a. 

15 



2.2 22 



tr 

■j 

c 



o — . 
*— rs 
O. 



3 v. 



CU 

rt 

c: _ 

<— w -5 

o 

*. J 3 
C t* 
- » C 



£ 

o 

a. 



ZZ 1 ^ 



o 

flu 



p » 5 £ S cfc 
dc o: C <£ E J 2 



*3 

£ 


-3 £^-3 
^ g J£ 


*3 

£ 


a 


*3 — : 

•Sis 






{/5 


3 

sr 


§■ 5 * 2 


< 


oS < O 


< 


> 


— < ;» 



. o 
! £ 



-3 

C 



C 

o 

%> 

o 



? *3 
O 3 
cu.t: 

Cm ~ 

O a 

J! 

3 a 



O 
n 



Is 

Cu ~ 

w 3 



JS *3 
?3 TT 



.3 

f 3 



= t= 

c o — 
? - -* 

o-5 ■§ §S 

5 = lie 

s« f. §1 

c. — — 



*3 d- 
•» sc 

22 
w « 
w « 

as 



- S 5 - 

c 5 a 

SC, o — 
c y CO 

=£ £< 



i 

u 



as 

e 

*3 

§ 

*4 

•3 

E 

o 

Z 



— n 



o 

Z 



+\ n 

o o 

>- > 



-3 C 

£ «Z 

~ o 



o 

Q. 



.C 

SC 



tc 

c 



% id 5 

— 'J <— • 



o 

c 



£ *. 
n * 
c 
*3 
C 



E- O C 

£ a 3 

c 



o 

o 



w O 



"3 

O 

c 

h » y r * 5 * W O 

« « J£ £ - w. 

2 o S 2 E rf ® 

§ 1 S| 2> | 821 

2 5 S < < Eo> -a f-= § 



£ £ £ 



^ £ x» ^ *. 2 

^ g o « £ £ 

^ n ?S | | 

3 O 

cu a. 



S o s' 
» E a: 



c 

rs 

JO 

sc 

c 



-3 

c 



— c 

f3 — 

3 * 

sr, *■» 

> < 



£ 


3 

£ 


£ *3 

E • 


3 

£ a 


*3 


CO 


«> 


£ S 


< 


< 


<> 


<> 



C 

rs 



= 

u 3 

u l> 

£ o 

sr 

a 5 



3 



« 

« 

E 



— x: 

« 

SC O C 

- 5 . | 

r\ 



*3 

C 

y 






C 

o 



a 



rs 

y 



cn 



1 !§■ 

sc“ 

c “i- 

2 § S 

|°o 2 

< CN 3 



c 

3 

-^O 

Cm (O 

o 

£ 

H 



u 

ERIC 



40 



48 



Appendix 1 (Continued) 



O u 

*5 
% E 

I 5 

ufc 



E * 
2 = 
S 3 

> *• 



E 5 
6 c 
to 

£ 



II 

o « 



C 3 

£3 

3 E 

o 7 

5 ** 
c *• 
— o 



2 * 

sS 

I- 

£ o 



c 

>• 



>• 



* 

o 

>• 



c 

u 

E 

50 

*3 

3 

a. 



© 

c 

o 

2 



2 = 
Ss 

4 ? 

1? 



c 

V 

£ 

? „ 

i£8 

r:.B? o 

fiei 

i“-s 

3a o n 

<o a 



o£ 

4 ^ 

3 V 

0 3 © 

"e? 

1 2 * 



o 

o 

♦»* 

x: ^ 

3»0 



* 

o 



o 

O w 

H “S 



c 

1 

si 

zt 



Iw 

al 

ff.s 

o S " 

»*.e 



5. a 

tO -3 © 
" C ^3 
Q n a 



>» 

.-a 

c 

3 



2 a 



Si 

IS 

> j 

S « 
5 5 

O 



-«= ® ?o 5 

~*41 -g . -ii? 

J ^1 c 

o «0 o 

t« - W y 

o o ° 

w 2? © " 

S' g «. ? 



*3 

3 

.2 

> 



— c O 
y a a 



3 ^ 



H £ ■S w 

SnE 5 S a 

°- • O a 

2 ?j< - V 

£gS 



© ~ C 

15 Z 

^ •? 22 



< > 



-3 

C 3 

22 J 
<> 



x: 

J» 

c 



.3 

CL 

o 

** 

o. 

Ou 



-3 

C 

B 

a 

W 

o 



c •* 

II 



c 

y 

E 

o 

o 

E « 

ij 

J 3 © 
« S 

o 3 

Z y 



-o 

c 

3 

o 

a 

© £ 
> 3 
o 2 

-S 

« o 

OO 
o O 
r- o> 

fro 

V u 

s« 

E- 

■3 » 

35 > 
< 2 



I? 

© 3 
> ** 
© h 

a 1 ® 

.5 g> 

5 -5 

2 3 

«4 <*• 

3 ** 

is 

O 2 

CC '3 



a. 

< 

•a 

-E 

u. 



& 

V 

CO 



5 5 



a ? 9 a 
5 c -2 1 3 
£*g 5 2 2-g 



to 

c 



•J 

*3 



«*"> **> 



n — ^ 



v 

>* 



© o u 
>• >•> 



© 

> 



o 

Z 



O 

Z 



o 

>• 



3 

z 



e c 

V © 

E £ 

so to 

g15 

zsTt 



5 5 
II 

-3 “3 
3 3 

Cu flu 



c 

© 

6 

so 

-3 

3 



-3 

3 



C 

© 

E 

Mr 

-3 

3 



C 

O 

z 



E 

so 

1 



•g 

c 

e* 

O 

c 

w 

o 

a. 

3 



>» 

•» 

35 

2 

.3 

3 



2 

3 

c 

3 



> __ 

u* A 

2 2 

o 6 £ 

E 2 n 

C3 2 CO 



£ - 
to o. 

I • 

^3 © 
c c 
w © 

^2 
Q 2 



1 

V 1 



c 

® o 

u a 
* 2 
-3 

<J£ 



5 O 



c 

a C m 
w — • © 

a 2 
o-a g | 
Z 5-s a. 



o 

Ou 



© 

E 



© 

© 

c 



2 w 3^ 

2 c c^3 
- o o° 

< Ou flu Z 



© 

*-* 

JC 

o 

flu 



o 

Qu 



S-s S 

TT *-• •£ 

r y c 
o ~ o 

Cu O Cu 





-3 




cE _ 


t3 


a 




a 


— i 


C 


3 

22 

> 


5 

<> 


3 

Vi 

> 


Ou < ^2 

05 > > 


C/3 



B 

© 



o 

Ou 



*3 

•S -- 3 ? 
S 3 § 
Ou o I 

35 > V) J 



© 

§ 



**o 

o 

C"> 

© 

*3 

3 



3 

C 

V 

J5 

© 

S3 

O 



c 

£ 

© 

© 



Ou 

a 

* . o 

E a w 

w>o- 
C y c M ? 

33 ^ - S5 



X> 



3 

j< 

o 

o 



c © s- ? 
-'3? 3- a 
* s E;wo 

£< 2 * o«r<D 

3 - s — 



■3 E 
2 "3 
5 O 

is 
•*• *2 
4<5 O 
© Qu 



^3 

3 

C* 



© 

I 

Ou 



© 

C/3 



u *3 



50 

£ 

*y 

rj _ 

o.'H 



^ « 

o-g 

§£ -S 

<2 2 3 

1 - 

© « *• .5 

z c -o 2 

<0 p 
a f-g 
^ < « 



E 

3 

E 

S 

E 



c 

o 

© 

«* 

© 

o 

Ou 



ii 

t/3 V) 



O 

ERIC 



4 4o 



Appendix I (Continued) 




to idle % 

At point of touchdown V.V, ind. Pointer moving At M 0** At M 0” Yes 

to "0" 

After touchdown nose wheel Att. ind. Aircraft moving Lowered gently IP judgment Yes 

lowered to runway Visual from above bar and started at 

to below bar right moment 
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APPENDIX IIP SUGGESTED ADDITIONAL ANALYSES 



This appendix is concerned with the identification of areas of data analysis considered pertinent to 
continued efforts in the development of pilot performance reference scales. As with most studies of this 
type where there is a great deal of raw data to be processed, there arc many statistical approaches which can 
be pursued. In addition, the results of one analysis usually suggest several other potential analyses. Of 
course, not all possible analyses arc useful; in this section every attempt has been made to include only 
those analyses thought to be the most useful to further developments of pilot performance reference scales 
or utilization of an AVRS during pilot training or retraining. The areas of data analyses arc concerned 
primarily with a greater in-depth analysis consistent with a determination as to why certain inconsistencies 
exist in the data already presented in the report and with additional data analyses thought to be relevant to 
the overall objectives of the Air Force’s flying training programs. Specific objectives of the analyses are (a) 
to provide additional insights and guidance into pilot performance scales prior to further scale development 
efforts; and ( b ) to provide a greater understanding of the factors which impact on decisions orjudgmet^s 
made by instructor pilots relevant to the assignment of level of performances. ^ 



There were many inconsistencies in the data presented in the main body of the data analysis. An 
in-depth analysis of these inconsistencies would provide a greater understanding of the actual usage of the 
pilot performance reference scales developed in this study and greater insights into the reasons why those 
scales were not as reliable as envisioned in order to develop rationales for their improvements (ergo, 
reliability). Such analyses miist be accomplished in conjunction with a study of the video replay of the 
maneuver examples involved in the inconsistency. The following suggested analyses arc considered to be 
minimal: 



1. Elements marked observed-not observed: The data suggest that the difference in opinions as to 
what elements could and could not be graded should be investigated in greaterdepth than has already been 
accomplished, (and reported). For example, in a particular series of observations, only 9 of the 23 
instructors who viewed the maneuver graded an element. Why? Or why did 14 instructors not grade the 
element? All such differences should be considered in the additional data analysis. 

2. Elements with high variances or a high spread of assigned performance levels should be analyzed 
in depth in order to ascertain what the primary problems might be. There are 19 such instances of high 
element variance which represent the initial concentration of effort. 

3. Several inconsistencies require a greater in-depth analysis in consonance with the stated objectives 
of this section. Investigations should be undertaken to determine why certain maneuver examples contain 
both high and low variability elements (example 2 of the Final Turn to Landing, examples 2, 4, and 8 of 
the Vertical S ’“A,” and example 7 of the Lazy Eight). 

Another area for study concerns examples 1 and 5 (identical performances), which appear to present 
a unique problem to the scales. The only consistency appears to be with elements 1C and 2D which show 
high variabilities in both examples. Element 2E is most unique in that it is highly variable in example 1 and 
is among those with the lowest variability in example 5. 

The following elements are listed among those with the highest relative variance and the lowest 
relative variance: elements 2E, 2G, and 4 of the Final Turn to Landing; elements 1C, 2G, and 21 of the 
Lazy Eight; and element 2E of the Vertical S “A”. The fact that the levels of performance assigned to these 
elements were determined from their demonstration within different examples of maneuvers (i.e., within 
different contexts) enhances the expected intelligence gain from a greater in-depth analysis., 

4. Examples 5 and 8 of the Final Turn to Landing are on opposite ends of the scale, and these two 
examples have the lowest variability of all the Final Turn to Landing examples. This situation exemplifies 
an Ideal result if the scales were found to have been reliable. (In fact, such a situation is a measure of scale 
reliability.) Unfortunately, it is known that the scales arc not that reliable. In conjunction with an analysis 
of the apparent reliability of the scales in judging the two Final Turn to Landing "examples, should be an 
analysis of two examples of the Vertical S “A” which arc on opposite ends of the scale (examples 5 and 8), 
further apart, but with greater degree of variability. Examples 5, 6, and 1 (taken as a unit) and example 8 of 
the Lazy Eight should also be included in the -analysis, but to a lesser extent since the difference between 
the two performance levels are less than the difference between the extremes in the other two maneuvers. 



5. In the Vertical S “A”, a difference of 1 .28 was shown in tire variances of the overall grades 
assigned to examples 3 and 9— identical performances. This is the largest difference between all duplicate 
examples. Yet the Vertical S “A” has been found not only to be the easiest maneuver to make judgments 
about, but that, in general, the variabilities of these judgments are the smallest. It was also noted that 
examples 2 and 8 of the Vertical S “A” had zero difference in variability. Again, an in-depth analysis should 
provide necessary pertinent inputs into the overall development of reliable scales. 

6. It is felt that the performance elements concerned with the set-up of a maneuver prior to 
maneuver performance (elements 1A through 1G) received too great an -emphasis during the data analysis. 
There is no doubt that proper set-up is a prerequisite to good maneuver performance and a skill to be 
learned. However, scaling each part of the overall set-up requirements (e.g., airspeed, altitude) and treating 
each part as an equal to elements concerned with maneuver performance is not considered to be 
appropriate. Therefore, it is suggested that the set-up elements and associated data be combined as a single 
element and that selected analyses similar to these already reported be repeated. The results from tire two 
sets of analysis should the i be compared as to impact on scale reliability and the scales themselves. 

7. The basic data bank contains information as to what actual syllabus mission each of the 25 (30 
minus duplicates) maneuver examples represented. The name of the student is also available. It is suggested 
that the levels of performance assigned by individual instructors and the instructors as a group be analyzed 
in conjunction with the syllabus mission flown relative to the overall syllabus. In the event the data analysis 
so suggests, the name of the student (and, thus, his flight training record) will provide additional data to the 
data analysis effort. 

8. The scale developed in this study was a 10-point scale. The data analysis was, therefore, oriented 
towards establisliing results relevant to that scale’s discriminatory properties as well as its reliability. 
Although some analysis was accomplished, it is suggested that the raw data be translated into 4-point and 
7-point scales (method for so doing previously explained). Selected analyses similar to those which used the 
10-point scale should then be repeated and the results compared and analyzed as they affect the number of 
discriminations that can reliably be made and the scale reliability. 

9. Further insights into the number of discriminations an instructor can make for each maneuver 
may be available from the basic data. At least one additional analysis seems to be indicated— and that is 
determining the number of discriminations that were made by the test instructors for each performance 
element both within the three types of maneuvers and across maneuvers. The same analysis should be 
accomplished, if the data are translated into 7- and 4-point scales^and a comparative analysis of the results 
made. 

10. The data obtained during this study contain insights into the process wherein evaluations of 
performance elements are integrated into a composite, overall evaluation of a maneuver example. At 
present, there is little doubt that this process, or formula, used by an instructor to arrive at an overall grade 
for a maneuver remains a mystery-and that there are probably as many formulas as there are instructors. 
Nonetheless, it is reasonable to assume that such formulas exist and that such formulas (including those 
which have different applications, since, for instance, the needs of an instructor differ in detail from tire 
needs of upper management) are basic to the future requirements of automated systems. It is suggested that 
this area of analysis should be accomplished as an initial step in the development of a formal mathematical 
model of how instructors grade, or should grade, maneuvers. 

■These suggestions do not constitute the entire rangi of data analysis which could be conducted with 
the available data, nor are they meant to be restrictive. However, they do reflect the conviction of the study 
group that the concepts and principles used to develop tire scales in this study were valid and that 
additional efforts to gain further insights which would broaden the base upon which scale revisions, 
refinements, and expansions could be effected are warranted. 
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